Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proxy startup and configuration is required for init_spawners, right? #2749

Closed
consideRatio opened this issue Sep 26, 2019 · 9 comments
Closed

Comments

@consideRatio
Copy link
Member

consideRatio commented Sep 26, 2019

I had a faulty assumption, this is not an issue.


If the hub starts up with a state about running user pods, they will have their individual Spawner objects initialized (init_spawners) again during startup of jupyterhub. As part of this, they will be probed for a life sign, and if they fail to respond they will be deleted. This user lifesign probe is relying on a proxy is available and configured. And here is the crux, can we be confident we have configured the proxy? I don't think so, that happens in check_routes but that is called after the init_spawners function unless a configurable amount of time passes, because then it will be called earlier in the end of the start function...

That was a discussion of configuring the proxy before init_spawners verifications are run, but what if the proxy isn't even started? Well, then its good that it becomes started by the start function which can run after the configurable timeout is reached...

The JupyterHub startup phase

Related

Issue about users being deleted when they shouldn't have been: jupyterhub/zero-to-jupyterhub-k8s#1370

@consideRatio consideRatio changed the title Proxy startup and configuration before init_spawners - right? Proxy startup and configuration is required for init_spawners, right? Sep 26, 2019
@consideRatio
Copy link
Member Author

After #2750 this structure described will change for the better, but there is still something to consider I think. What dynamics are caused by init_spawners_timeout duration in conjunction with the patience the checks triggered by init_spawners have, which I think is represented by the http_timeout in the Spawner baseclass.

@minrk
Copy link
Member

minrk commented Sep 26, 2019

This is the key misunderstanding:

This user lifesign probe is relying on a proxy is available and configured

The Hub probing servers in init_spawners does not involve the proxy at all.

init_spawners exclusively verifies the Hub's internal state about which spawners are running and where. Only after init_spawners completes is check_routes called, which reconciles the internal state of the Hub with the proxy.

The startup phase:

  • await initialize
    • await init_spawners (initializes spawner state, proxy is not consulted)
  • await start
    • await check_routes <- this is the only time the proxy is involved, where hub state is reconciled with the proxy

In JupyterHub 1.0, init_spawners is guaranteed to be complete before the proxy is consulted (or even started, in the default case where the Hub starts the proxy).

#2721 complicated this a couple days ago by allowing init_spawners to be incomplete when that first check_routes is called. Any Spawners that are still waiting on a check are in a 'pending' state, which means whatever their state in the proxy is, will not be modified. To deal with this, check_routes is called again as soon as init_spawners completes.

[There] was a discussion of configuring the proxy before init_spawners verifications are run

where was this? It doesn't really make sense to do that.

@consideRatio
Copy link
Member Author

init_spawners exclusively verifies the Hub's internal state about which spawners are running and where.

I believe that init_spawners awaits many calls to check_spawner which will delete users that fail to respond properly.


check_spawner requires the proxy to be configured as it will run checks towards users and kill them if they fail to respond, so if there is no routing available for these checks to succeed that's a problem. This can be seen in the code below.

jupyterhub/jupyterhub/app.py

Lines 1940 to 1951 in 5b13f96

self.log.debug(
"Verifying that %s is running at %s", spawner._log_name, url
)
try:
await user._wait_up(spawner)
except TimeoutError:
self.log.error(
"%s does not appear to be running at %s, shutting it down.",
spawner._log_name,
url,
)
status = -1

The calls to check_spawner that are depending on the proxy to be configured, is awaited within init_spawners here.

jupyterhub/jupyterhub/app.py

Lines 1974 to 2000 in 5b13f96

# parallelize checks for running Spawners
check_futures = []
for orm_user in db.query(orm.User):
user = self.users[orm_user]
self.log.debug("Loading state for %s from db", user.name)
for name, orm_spawner in user.orm_spawners.items():
if orm_spawner.server is not None:
# spawner should be running
# instantiate Spawner wrapper and check if it's still alive
spawner = user.spawners[name]
# signal that check is pending to avoid race conditions
spawner._check_pending = True
f = asyncio.ensure_future(check_spawner(user, name, spawner))
check_futures.append(f)
TOTAL_USERS.set(len(self.users))
# it's important that we get here before the first await
# so that we know all spawners are instantiated and in the check-pending state
# await checks after submitting them all
if check_futures:
self.log.debug(
"Awaiting checks for %i possibly-running spawners", len(check_futures)
)
await gen.multi(check_futures)
db.commit()


So, if for example init_spawners were to be awaited on a system where both the proxy and hub restarts at the same time, the proxy will have lost its state, the hub will remember users and initialize them and later kill them, and then configure the proxy.

@minrk
Copy link
Member

minrk commented Sep 26, 2019

I believe that init_spawners awaits many calls to check_spawner which will delete users that fail to respond properly.

Yes, and rightly so. The proxy is irrelevant, though, because the Hub always talks to spawners directly. No internal component of JupyterHub ever communicates via the proxy. Thinking about what the check is for: this is the check of what servers are running, in order to determine what the proxy should do. The proxy cannot be a requirement for determining what the proxy's routes should be.

if for example init_spawners were to be awaited on a system where both the proxy and hub restarts at the same time, the proxy will have lost its state, the hub will remember users and initialize them and later kill them, and then configure the proxy.

This is what happens very often with jupyterhub upgrades and no users are deleted, because it goes like this:

  • proxy restarts, state is empty
  • hub restarts, starts polling user servers
  • (init_spawners) the user servers that respond are loaded as 'running', the ones that don't are loaded as 'stopped'
  • (check_routes) the proxy table is retrieved
  • any routes for servers that are running are added if missing in the proxy table (if the proxy also restarted, this will be everything)
  • any routes for servers that are not responsive are removed if they are present in the proxy table (this can only happen if the hub restarted and the proxy did not)

@minrk
Copy link
Member

minrk commented Sep 26, 2019

Take the default jupyterhub configuration, where the Hub starts the proxy with c.JupyterHub.cleanup_servers = False: The proxy is not started until after init_spawners is complete. If this didn't work, we would have a problem a long time ago.

@consideRatio
Copy link
Member Author

@minrk ah hmmm so the hub, from within app.py/init_spawners invoking check_spawn invoking user.py/_wait_up speaks directly to the user server, and does not require any routing from the proxy?

Ah...

async def _wait_up(self, spawner):
"""Wait for a server to finish starting.
Shuts the server down if it doesn't respond within
spawner.http_timeout.
"""
server = spawner.server
key = self.settings.get('internal_ssl_key')
cert = self.settings.get('internal_ssl_cert')
ca = self.settings.get('internal_ssl_ca')
ssl_context = make_ssl_context(key, cert, cafile=ca)
try:
resp = await server.wait_up(
http=True, timeout=spawner.http_timeout, ssl_context=ssl_context
)

@minrk
Copy link
Member

minrk commented Sep 26, 2019

@consideRatio discussions like this sound like a great occasion for improving architecture docs! We currently have this overview, but it may not be clear who talks directly to whom and when.

This diagram, for instance:

attempts to communicate that we have:

  • proxy talks to the hub and notebooks
  • notebooks talk to the hub
  • hub talks to notebooks (missing!)

Critical points in JupyterHub architecture:

  • One of the main tasks of the Hub is to ensure the Proxy is routing requests to the right places
  • The proxy is exclusively for external communication. JupyterHub never uses the proxy for internal communication.
  • The proxy being down or restarting or slow does not affect JupyterHub's internal function, except for when it is checking the proxy's state itself.

What happens when the Hub is checking if a server is alive? (occurs at startup and as the last stage of spawner start)

  • Spawner determines URL (e.g. http://host:port) where the server is running
  • Hub connects directly to this URL
  • If the Hub successfully connects to this URL, the proxy should route /user/name to this URL

@minrk
Copy link
Member

minrk commented Sep 26, 2019

speaks directly to the user server, and does not require any routing from the proxy?

Yes, exactly! This is the condition that is required before the Hub will add the route to the proxy. We can't require that it be in the proxy before we decide if it should be added to the proxy.

@rkdarst
Copy link
Contributor

rkdarst commented Sep 27, 2019

@consideRatio discussions like this sound like a great occasion for improving architecture docs! We currently have this overview, but it may not be clear who talks directly to whom and when.

#2726 was my attempt at more, sort of what I learned with technical overview + everything else. Sometime I could make a pass at improving the technical overview too, but how should these two pages relate?

For another "JupyterHub for sysadmins" talk I made a more detailed architecture diagram, [here]
(https://docs.google.com/presentation/d/1Izs1EJJLqNUCqnEblatc59CznnKNz0uA-33T6g6bMf8). I've been meaning to submit it to the JH docs for a while. What do you think? (I now know of a few problems that need fixing in it...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants