Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci system following two merged PRs get stuck #1237

Closed
consideRatio opened this issue Jan 6, 2021 · 6 comments · Fixed by #1242
Closed

ci system following two merged PRs get stuck #1237

consideRatio opened this issue Jan 6, 2021 · 6 comments · Fixed by #1242
Labels
maintenance Under the hood improvements and fixes

Comments

@consideRatio
Copy link
Member

consideRatio commented Jan 6, 2021

I merged #1236 and #1235 which both seemed fine to me, but now our tests fail in a new way. Did the combination of these PRs introduce a new bug? I'm clueless.

Test failure: https://github.com/jupyterhub/binderhub/actions/runs/466459518

image


From the namespace report, it seems like the binder pod became ready and running though...

NAME                             READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/proxy            1/1     1            1           10m
deployment.apps/user-scheduler   2/2     2            2           10m
deployment.apps/hub              1/1     1            1           10m
deployment.apps/binder           1/1     1            1           10m

The actual test seem to be...

  . ci/common
  await_binderhub
  echo "curl http://localhost:30901/health" to check BinderHub\'s health
  curl http://localhost:30901/health

And the initial output of this test is...

deployment "proxy" successfully rolled out
deployment "hub" successfully rolled out
Waiting for deployment "binder" rollout to finish: 0 of 1 updated replicas are available...
deployment "binder" successfully rolled out
curl http://localhost:30901/health to check BinderHub's health
<THIS IS WHERE IT ALL GETS STUCK: in the "curl http://localhost:30901/health" statement>

It seems to me that the /health endpoint now fail to respond or perhaps streams a response or similar that is never completed?

@consideRatio consideRatio changed the title B ci system following two merged PRs get stuck Jan 6, 2021
@consideRatio
Copy link
Member Author

Here is a test run from #1238 that provides more details about the state of things associated with new failure. Nothing seems wrong to me, but I'm not sure what causes the curl request to get stuck.

https://github.com/jupyterhub/binderhub/pull/1238/checks?check_run_id=1657413955

@betatim
Copy link
Member

betatim commented Jan 6, 2021

I think it has happened before that the curl call gets stuck. Restarting the CI run seems to fix it. So I think this is a flakey test/setup kind of issue :-/

@consideRatio
Copy link
Member Author

No because the echoed statement is executes

@manics
Copy link
Member

manics commented Jan 6, 2021

Is there a potential race condition? Could the check in

kubectl rollout status --watch --timeout 300s deployment/binder

succeed before /health and/or the network routing is ready?

@consideRatio
Copy link
Member Author

consideRatio commented Jan 6, 2021

Hmmmm, well there are no readiness probe for the binder deployment apparently, so just attempting to start (having pulled the image etc) will make the rollout status work clear. So, yes, but only if it is slower than the hub and proxy pod, which it could be.

Okay to add a readiness probe? PR in #1242.
Okay to merge #1238? EDIT: merged, ci pr only.

@consideRatio
Copy link
Member Author

From some experimentation, I found enough evidence to support the idea that @manics guess was correct.

Without a readinessProbe, the following only concludes the binder pod has started, not that it has become ready.

kubectl rollout status --watch --timeout 300s deployment/binder

Due to this, I marked #1242 to close this issue.

@consideRatio consideRatio added maintenance Under the hood improvements and fixes and removed bug labels Jan 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maintenance Under the hood improvements and fixes
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants