Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JupyterHub CrashLoopBackOff #493

Closed
kdubovikov opened this issue Feb 10, 2018 · 15 comments · Fixed by #1422
Closed

JupyterHub CrashLoopBackOff #493

kdubovikov opened this issue Feb 10, 2018 · 15 comments · Fixed by #1422

Comments

@kdubovikov
Copy link

kdubovikov commented Feb 10, 2018

I am trying to spin up JupyterHub using helm and all resources start successfully, but after a short time hub pod enters CrashLoopBackOff.

Installation was performed using the following command:

helm install jupyterhub/jupyterhub --version=v0.6 --name=jupyterhub --namespace jupyterhub -f ./jupyterhub/config.yaml --timeout=1000000

I've also tested version 0.5 and got the same results.

Logs:

$ kubectl logs po/hub-56d985bfb8-vb6pl --namespace jupyterhub
[I 2018-02-10 08:21:27.439 JupyterHub app:830] Loading cookie_secret from env[JPY_COOKIE_SECRET]
[W 2018-02-10 08:21:27.673 JupyterHub app:955] No admin users, admin interface will be unavailable.
[W 2018-02-10 08:21:27.673 JupyterHub app:956] Add any administrative users to `c.Authenticator.admin_users` in config.
[I 2018-02-10 08:21:27.673 JupyterHub app:983] Not using whitelist. Any authenticated user will be allowed.
[I 2018-02-10 08:21:28.025 JupyterHub app:1528] Hub API listening on http://0.0.0.0:8081/hub/
[I 2018-02-10 08:21:28.026 JupyterHub app:1538] Not starting proxy
[I 2018-02-10 08:21:28.026 JupyterHub app:1544] Starting managed service cull-idle
[I 2018-02-10 08:21:28.026 JupyterHub service:266] Starting service 'cull-idle': ['/usr/local/bin/cull_idle_servers.py', '--timeout=3600', '--cull-every=600', '--url=http://127.0.0.1:8081/hub/api']
[I 2018-02-10 08:21:28.053 JupyterHub service:109] Spawning /usr/local/bin/cull_idle_servers.py --timeout=3600 --cull-every=600 --url=http://127.0.0.1:8081/hub/api
[I 2018-02-10 08:21:28.263 JupyterHub log:122] 200 GET /hub/api/users (cull-idle@127.0.0.1) 25.95ms
[E 2018-02-10 08:21:48.064 JupyterHub app:1623]
    Traceback (most recent call last):
      File "/usr/local/lib/python3.5/dist-packages/jupyterhub/app.py", line 1621, in launch_instance_async
        yield self.start()
      File "/usr/local/lib/python3.5/dist-packages/jupyterhub/app.py", line 1569, in start
        yield self.proxy.check_routes(self.users, self._service_map)
      File "/usr/local/lib/python3.5/dist-packages/jupyterhub/proxy.py", line 294, in check_routes
        routes = yield self.get_all_routes()
      File "/usr/local/lib/python3.5/dist-packages/jupyterhub/proxy.py", line 589, in get_all_routes
        resp = yield self.api_request('', client=client)
    tornado.curl_httpclient.CurlError: HTTP 599: Connection timed out after 20000 milliseconds

Namespace status:

kubectl get all --namespace dljupyterhub   

NAME           DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
deploy/hub     1         1         1            0           22m
deploy/proxy   1         1         1            1           22m

NAME                  DESIRED   CURRENT   READY     AGE
rs/hub-5479595c8d     1         1         0         22m
rs/proxy-6fbf784dbd   1         1         1         22m

NAME           DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
deploy/hub     1         1         1            0           22m
deploy/proxy   1         1         1            1           22m

NAME                  DESIRED   CURRENT   READY     AGE
rs/hub-5479595c8d     1         1         0         22m
rs/proxy-6fbf784dbd   1         1         1         22m

NAME                        READY     STATUS             RESTARTS   AGE
po/hub-5479595c8d-7qhzb     0/1       CrashLoopBackOff   7          22m
po/proxy-6fbf784dbd-pt5q6   2/2       Running            0          22m

NAME                               TYPE           CLUSTER-IP      EXTERNAL-IP   PORT(S)                      AGE
svc/glusterfs-dynamic-hub-db-dir   ClusterIP      10.104.97.71    <none>        1/TCP                        15m
svc/hub                            ClusterIP      10.106.231.99   <none>        8081/TCP                     22m
svc/proxy-api                      ClusterIP      10.104.175.74   <none>        8001/TCP                     22m
svc/proxy-http                     ClusterIP      10.106.46.200   <none>        8000/TCP                     22m
svc/proxy-public                   LoadBalancer   10.109.72.153   <pending>     80:32500/TCP,443:31790/TCP   22m

Contents of config.yaml

hub:
  cookieSecret: "aaa"
proxy:
  secretToken: "bbb"
singleuser:
  storage:
    capacity: 2Gi
    dynamic:
      storageClass: gluster-heketi
ingress:
    enabled: true
    hosts:
     - host1
@yuvipanda
Copy link
Collaborator

Heya @kdubovikov! Thanks for filing this issue!

It looks like the hub pod can not reach the proxy pod. Is Pod networking and kube-proxy working properly? I suspect this is an OpenStack installation / bare-metal setup. Are other services on the cluster working fine? Does https://scanner.heptio.com/ find any issues?

@kdubovikov
Copy link
Author

Hey @yuvipanda , thanks for the response. All other services are working fine (we also run glusterfs). I've ran the tests and no issues have been found:

Ran 125 of 710 Specs in 3156.548 seconds
SUCCESS! -- 125 Passed | 0 Failed | 0 Pending | 585 Skipped PASS

Also, I am able to run Jupyter Hub with KubeSpawner outside of the cluster without issues.

@yuvipanda
Copy link
Collaborator

Hmm, in that case I'm at a loss about what is going on :(

@willingc
Copy link
Collaborator

Ping @minrk. Any thoughts?

Are you still seeing this issue @kdubovikov?

@minrk
Copy link
Member

minrk commented Feb 28, 2018

It does seem like a networking problem, but I'm not sure what the best way to debug it would be. You could edit the Hub command to run a while true; do sleep 10; done and then kubectl exec hub-pod bash and see if you can communicate with the proxy via curl/etc.

You could also try communicating with the proxy from another context (e.g. outside the cluster, another pod, etc.) to be sure that the proxy pod is accepting connections.

Do you have any NetworkPolicy config on the cluster?

@kdubovikov
Copy link
Author

@minrk, I think no NetworkPolicy is present. The cluster was set up using kubeadm. Cloud you clarify on where do I need to change the Hub command?

@minrk
Copy link
Member

minrk commented Mar 2, 2018

You can edit the jupyterhub command with:

kubectl edit deployment hub

and change the command that looks like:

      - command:
        - jupyterhub
        - --config
        - /srv/jupyterhub_config.py
        - --upgrade-db

to

      - command:
        - sh
        - -c
        - while true; do sleep 10; done

This will create a new hub pod with the new command, which you can kubectl exec -it into.

@yuandongfang
Copy link

did you got this solution? i alse have this problem.who can help me? thanks for all of you.

Name: hub-86d676cf88-jw8ws
Namespace: jupyterhubtest
Node: 192.168.0.5/192.168.0.5
Start Time: Tue, 10 Apr 2018 11:34:12 +0800
Labels: app=jupyterhub
component=hub
heritage=Tiller
name=hub
pod-template-hash=4282327944
release=jupyterhubfork8s
Status: Running
IP: 172.18.0.26
Controllers: ReplicaSet/hub-86d676cf88
Containers:
hub-container:
Container ID: docker://550c1ae33d73c965a87a50bd87f2b87fcafa498f3b4a7e59b807828ef15cea63
Image: jupyterhub/k8s-hub:4b122ad
Image ID: docker-pullable://jupyterhub/k8s-hub@sha256:b1fb9dd9eec9a9aab583addd8f03fd035494681ac224cfaa55126de442eeecd3
Port: 8081/TCP
Command:
jupyterhub
--config
/srv/jupyterhub_config.py
Requests:
cpu: 200m
memory: 512Mi
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Tue, 10 Apr 2018 12:06:03 +0800
Finished: Tue, 10 Apr 2018 12:06:04 +0800
Ready: False
Restart Count: 11
Volume Mounts:
/etc/jupyterhub/config/ from config (rw)
/etc/jupyterhub/secret/ from secret (rw)
/var/run/secrets/kubernetes.io/serviceaccount from hub-token-lwmrd (ro)
Environment Variables:
SINGLEUSER_IMAGE: jupyterhub/k8s-singleuser-sample:5d060de
JPY_COOKIE_SECRET: <set to the key 'hub.cookie-secret' in secret 'hub-secret'>
POD_NAMESPACE: jupyterhubtest (v1:metadata.namespace)
CONFIGPROXY_AUTH_TOKEN: <set to the key 'proxy.token' in secret 'hub-secret'>
Conditions:
Type Status
Initialized True
Ready False
PodScheduled True
Volumes:
config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: hub-config
secret:
Type: Secret (a volume populated by a Secret)
SecretName: hub-secret
hub-token-lwmrd:
Type: Secret (a volume populated by a Secret)
SecretName: hub-token-lwmrd
QoS Class: Burstable
Tolerations:
Events:
FirstSeen LastSeen Count From SubObjectPath Type Reason Message


32m 32m 1 {default-scheduler } Normal Scheduled Successfully assigned hub-86d676cf88-jw8ws to 192.168.0.5
32m 32m 1 {kubelet 192.168.0.5} Normal SuccessfulMountVolume MountVolume.SetUp succeeded for volume "config"
32m 32m 1 {kubelet 192.168.0.5} Normal SuccessfulMountVolume MountVolume.SetUp succeeded for volume "secret"
32m 32m 1 {kubelet 192.168.0.5} Normal SuccessfulMountVolume MountVolume.SetUp succeeded for volume "hub-token-lwmrd"
32m 32m 1 {kubelet 192.168.0.5} spec.containers{hub-container} Normal Pulling pulling image "jupyterhub/k8s-hub:4b122ad"
32m 32m 1 {kubelet 192.168.0.5} spec.containers{hub-container} Normal Pulled Successfully pulled image "jupyterhub/k8s-hub:4b122ad"
32m 31m 4 {kubelet 192.168.0.5} spec.containers{hub-container} Normal Created Created container
32m 31m 4 {kubelet 192.168.0.5} spec.containers{hub-container} Normal Started Started container
32m 31m 3 {kubelet 192.168.0.5} spec.containers{hub-container} Normal Pulled Container image "jupyterhub/k8s-hub:4b122ad" already present on machine
32m 17m 67 {kubelet 192.168.0.5} spec.containers{hub-container} Warning BackOff Back-off restarting failed container
32m 2m 135 {kubelet 192.168.0.5} Warning FailedSync Error syncing pod

@ryanlovett
Copy link
Collaborator

I'm seeing this on GKE. We were running v0.6 and tried to upgrade to the latest chart. After some helm failures I reverted to v0.6 but ran into this. I've tried deleting the pods and deployments. I'll do some debugging.

@ryanlovett
Copy link
Collaborator

ryanlovett commented Jul 9, 2018

There's no curl or wget in the pod. With python3+requests I can confirm the tornado.curl_httpclient.CurlError error that the proxy-api endpoint times out. proxy-public and proxy-http are responsive.

The cluster has:

addonsConfig:
  networkPolicyConfig:
    disabled: true

@ryanlovett
Copy link
Collaborator

The proxy-api object was referencing a newer version of the helm chart -- one that I had previously tried to upgrade to. I deleted the proxy-api object, then reran my CI to do a helm upgrade and now everything is working.

@ryanlovett
Copy link
Collaborator

I'm ran into this again on Azure after a helm upgrade. Unlike last time, I couldn't access any of the service endpoints. I have a feeling this last occasion is due to the infrastructure and not z2jh, but I just thought I'd leave a trail marker.

@consideRatio
Copy link
Member

Hmmm, @ryanlovett wrote:

The proxy-api object was referencing a newer version of the helm chart -- one that I had previously tried to upgrade to. I deleted the proxy-api object, then reran my CI to do a helm upgrade and now everything is working.

Does this mean that our proxy pod did not trigger restart as it should, or that it persisted some faulty state that needed to be refreshed? Ideas on what state that were outdated?

@ryanlovett we have now released 0.7.0, any feedback on your update to that would be very relevant. If you do, just make sure to follow upgrade instructions in the changelog.md file.

@diegodorgam
Copy link

any thoughs on this?

@consideRatio
Copy link
Member

consideRatio commented Sep 30, 2019

I found that these errors happen when the hub and proxy gets an update at the same time. The hub is going to crash if it fails to communicate with the proxy, but realizing the failure happens 20 seconds later and by this time, the hub can be apparently functional. When we bump the JupyterHub version the next time, we will get to use jupyterhub/jupyterhub#2750, it will make the hub pod look stay unavailable until it actually will function reliable.

Perhaps we bump it along with #1422, or earlier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
8 participants