Recreate as CHP proxy pod's deployment strategy #1401

consideRatio · 2019-09-10T08:56:18Z

Using a rolling update by default on the proxy pod is a mistake by us,
because of the JupyterHub / CHP proxy interaction. JupyterHub assumes in
check_routes / add_route etc to be speaking to one specific CHP proxy
server, but there can be different ones responding if we make an upgrade
and the proxy pod is making a rolling upgrade.

For example, consider a hub pod making a recreate upgrade, and a proxy
pod makinga rolling upgrade. The new hub pod could for example get ready
before the proxy pod and start speaking with the old proxy pod and later
at a crucial point start speaking with the new pod. If you switch to
speaking with the new pod at the wrong time, you may end up with failure
to get responses from user pods that are verified to be around, and then
they are deleted.

So, this commit hope to fix a sneaky bug where user pods are deleted
during upgrades where the proxy pod is also updated!

Note that with the traefik proxy that would store state in a key value store, this may not be a problem, but we don't yet use traefik proxy.

jupyterhub/values.yaml

betatim

Nice catch! Looks good to me modulo the two nits.

Knock knock
Race condition
Who's there?

😂

Using a rolling update by default on the proxy pod is a mistake by us, because of the JupyterHub / CHP proxy interaction. JupyterHub assumes in check_routes / add_route etc to be speaking to one specific CHP proxy server, but there can be different ones responding if we make an upgrade and the proxy pod is making a rolling upgrade. For example, consider a hub pod making a recreate upgrade, and a proxy pod makinga rolling upgrade. The new hub pod could for example get ready before the proxy pod and start speaking with the old proxy pod and later at a crucial point start speaking with the new pod. If you switch to speaking with the new pod at the wrong time, you may end up with failure to get responses from user pods that are verified to be around, and then they are deleted. So, this commit hope to fix a sneaky bug where user pods are deleted during upgrades where the proxy pod is also updated!

Upgrades from previous state to this would fail without this fix of the issue caused by removing the fix in this PR: jupyterhub#1401

Bugfix for proxy upgrade strategy pr #1401

consideRatio · 2019-09-10T13:42:51Z

After this, it will be typical to find these kinds of errors on the hub starting up until the proxy becomes ready again.

[E 2019-09-10 13:40:47.342 JupyterHub app:2482]
    Traceback (most recent call last):
      File "/usr/local/lib/python3.6/dist-packages/jupyterhub/app.py", line 2480, in launch_instance_async
        await self.start()
      File "/usr/local/lib/python3.6/dist-packages/jupyterhub/app.py", line 2405, in start
        await self.proxy.check_routes(self.users, self._service_map)
      File "/usr/local/lib/python3.6/dist-packages/jupyterhub/proxy.py", line 62, in locked_method
        return await method(*args, **kwargs)
      File "/usr/local/lib/python3.6/dist-packages/jupyterhub/proxy.py", line 315, in check_routes
        routes = await self.get_all_routes()
      File "/usr/local/lib/python3.6/dist-packages/jupyterhub/proxy.py", line 804, in get_all_routes
        resp = await self.api_request('', client=client)
      File "/usr/local/lib/python3.6/dist-packages/jupyterhub/proxy.py", line 773, in api_request
        result = await client.fetch(req)
    tornado.curl_httpclient.CurlError: HTTP 599: Connection timed out after 20001 milliseconds
    
[D 2019-09-10 13:40:47.344 JupyterHub application:647] Exiting application: jupyterhub
ERROR:asyncio:Task exception was never retrieved
future: <Task finished coro=<JupyterHub.launch_instance_async() done, defined at /usr/local/lib/python3.6/dist-packages/jupyterhub/app.py:2477> exception=SystemExit(1,)>
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/jupyterhub/app.py", line 2480, in launch_instance_async
    await self.start()
  File "/usr/local/lib/python3.6/dist-packages/jupyterhub/app.py", line 2405, in start
    await self.proxy.check_routes(self.users, self._service_map)
  File "/usr/local/lib/python3.6/dist-packages/jupyterhub/proxy.py", line 62, in locked_method
    return await method(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/jupyterhub/proxy.py", line 315, in check_routes
    routes = await self.get_all_routes()
  File "/usr/local/lib/python3.6/dist-packages/jupyterhub/proxy.py", line 804, in get_all_routes
    resp = await self.api_request('', client=client)
  File "/usr/local/lib/python3.6/dist-packages/jupyterhub/proxy.py", line 773, in api_request
    result = await client.fetch(req)
tornado.curl_httpclient.CurlError: HTTP 599: Connection timed out after 20001 milliseconds

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/jupyterhub/app.py", line 2492, in launch_instance
    loop.start()
  File "/usr/local/lib/python3.6/dist-packages/tornado/platform/asyncio.py", line 148, in start
    self.asyncio_loop.run_forever()
  File "/usr/lib/python3.6/asyncio/base_events.py", line 438, in run_forever
    self._run_once()
  File "/usr/lib/python3.6/asyncio/base_events.py", line 1451, in _run_once
    handle._run()
  File "/usr/lib/python3.6/asyncio/events.py", line 145, in _run
    self._callback(*self._args)
  File "/usr/local/lib/python3.6/dist-packages/jupyterhub/app.py", line 2483, in launch_instance_async
    self.exit(1)
  File "/usr/local/lib/python3.6/dist-packages/traitlets/config/application.py", line 648, in exit
    sys.exit(exit_status)
SystemExit: 1

limeicao · 2020-12-25T10:06:40Z

hi,guys ,have you solved the questions , tornado.curl_httpclient.CurlError: HTTP 599: Connection timed out after 20001 milliseconds,i have meet the same problems, it bothers me for several days , please do give me some suggestions . waiting the response

consideRatio · 2020-12-25T10:09:00Z

Yes, use the latest version, 0.10.6 of the helm chart.

betatim reviewed Sep 10, 2019

View reviewed changes

jupyterhub/values.yaml Outdated Show resolved Hide resolved

betatim reviewed Sep 10, 2019

View reviewed changes

jupyterhub/values.yaml Outdated Show resolved Hide resolved

betatim reviewed Sep 10, 2019

View reviewed changes

jupyterhub/values.yaml Show resolved Hide resolved

betatim approved these changes Sep 10, 2019

View reviewed changes

consideRatio force-pushed the proxy-doesnt-support-rollingupgrades branch from 902a032 to 4bb76c7 Compare September 10, 2019 11:36

consideRatio merged commit 3a5b37a into jupyterhub:master Sep 10, 2019

consideRatio mentioned this pull request Sep 10, 2019

Bugfix for proxy upgrade strategy pr #1401 #1404

Merged

consideRatio added a commit that referenced this pull request Sep 10, 2019

Merge pull request #1404 from consideRatio/proxy-upgrade-fix

9c15a42

Bugfix for proxy upgrade strategy pr #1401

jhamman mentioned this pull request Sep 27, 2019

Deployment failures following #415 pangeo-data/pangeo-cloud-federation#418

Closed

consideRatio mentioned this pull request Sep 28, 2019

PR Discussion - CI chart upgrade tests #1427

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recreate as CHP proxy pod's deployment strategy #1401

Recreate as CHP proxy pod's deployment strategy #1401

consideRatio commented Sep 10, 2019

betatim left a comment •

edited

Loading

consideRatio commented Sep 10, 2019

limeicao commented Dec 25, 2020

consideRatio commented Dec 25, 2020

Recreate as CHP proxy pod's deployment strategy #1401

Recreate as CHP proxy pod's deployment strategy #1401

Conversation

consideRatio commented Sep 10, 2019

betatim left a comment • edited Loading

Choose a reason for hiding this comment

consideRatio commented Sep 10, 2019

limeicao commented Dec 25, 2020

consideRatio commented Dec 25, 2020

betatim left a comment •

edited

Loading