Builds: check celery and redis before marking a build as stale #8269

humitos · 2021-06-16T15:41:13Z

Previously, we were checking only if the build has been running for a long-long
time (2hs) before marking it as stale.

This new approach tries to reduce this time to and adds extra checks on top of
that:

the build has been running for more than 15 minutes
the build is not being executed by Celery (celery.inspect.active)
the build is not queued in Redis queues

When all of these are True we mark the build as stale.

humitos · 2021-07-01T08:42:01Z

readthedocs/projects/tasks.py


-    These inactive builds will be marked as ``success`` and ``FINISHED`` with an
+    - it's not in ``FINISHED`` state
+    - it was created +15 minutes ago


I think we can reduce this time. If the task is not queued nor being executed, it already means that it's inactive this particular build won't change its state.

ericholscher

This feels super error-prone and implementation specific. Especially something that doesn't have any tests, I'm 99% sure this will break within a year because of a change in Celery or Redis, and we will likely never know. I don't know that it's really a useful solution to this problem, but I guess we can try it -- but I don't know that we'll even know if it's working properly or not, to be honest.

ericholscher · 2021-10-04T20:58:39Z

readthedocs/projects/tasks.py

+        build_stale = True
+
+        # 1. check if it's being executed by celery
+        for queue, tasks in app.control.inspect().active():


Does this reliably return all servers in production? I feel like in the past that this has often missed many of the running builders.

I remember that I tested this and was accurate. I don't know how trustable it may be, tho.

Another quick test today in production showed accurate numbers at least:

# number of builds marked as running in the db / numbers of builds executed by workers In [49]: (Build.objects.exclude(state__in=('finished', 'triggered')).count(), sum([len(tasks) for queue, tasks in app.control.inspect().active ...: ().items() if queue.startswith('build') and all([True for task in tasks if task['name'] == 'readthedocs.projects.tasks.update_docs_ta ...: sk'])])) Out[49]: (16, 16)

I don't think this will be 100% accurate at all times, tho. Mainly because the DB is immediate and the worker takes a small time to start the queued task. So, there may be a difference there. However, this should only happen at the edges (when the task is about to start or about to end) and both cases should be covered. The "about to start" because we have a minimum delta time running and the "about to end" because the task will still be present in Celery but not in DB --which won't make the algorithm to mark it as stale.

Cool -- I'm guessing it's less of an issue with the scaling groups, since we don't have super long-lived celery instances, which is likely what was causing the issue.

humitos · 2021-10-05T11:30:46Z

@ericholscher I'm fine adding some tests to check the Redis structure expected so it fails if Celery changes it for some reason. I'm also fine doing some more tests in production before deploying it.

This PR is quite old and I remember testing it locally only and opening it here to start the discussion.

If we don't feel comfortable yet, we can QA it more (locally, on production, and with test cases). However, I'd like to find a better solution than the current one that blocks users' builds for hours.

ericholscher · 2021-10-05T16:37:54Z

@humitos Yea, our current approach is definitely not great. If we can find a way to actually detect this failing automatically, I'd be more comfortable with merging it. I don't think tests are as important here, since most of the issues we'll hit will be in differences in production instances (eg. redis gets upgraded or something, and we don't change the test suite).

humitos · 2021-10-06T08:47:37Z

@ericholscher other than the concerns about the future stability of it, do you think the logic/algorithm is accurate and will act over the builds we expect?

humitos · 2021-10-06T09:10:26Z

Hrm, actually, I just found that we had some tests for this it was marked as xfail. I tried to adapt it to the new method, but it requires a Celery instance running because the call app.control.inspect().active() and we don't have one running with the test. We could mock that, but we lose possible structure differences coming from Celery we are interested in to check.

humitos · 2021-10-06T10:46:08Z

@ericholscher I did an extensive QA locally for the algorithm and I adapted the code to handle edge cases and added more logging to it.

Since this runs only on web queue, we can "live hotfix" it in one web-celery instance before fully deploying it and check the logs closely to understand how it behaves. Looking that it's not finishing builds that are valid.

humitos · 2021-10-20T11:35:08Z

Marked as blocked for now so we don't merge before QAing on production with a closer look.

Previously, we were checking only if the build has been running for a long-long time (2hs) before marking it as stale. This new approach tries to reduce this time to 15m and adds extra checks on top of that: - the build has been running for more than 15 minutes - the build is not being executed by Celery (celery.inspect.active) - the build is not queued in Redis queues When all of these are `True` we mark the build as stale.

agjohnson · 2022-03-08T00:31:45Z

There were concerns on this before this work stalled out, and this work wasn't on our sprint, so I am going to remove this from the current release project. I'd like more eyes on this before merging. I'm concerned that we're deep into the celery api interactions, an area that has caused trouble for us in the past.

I'll put it on next release for now

ericholscher · 2022-03-14T18:12:30Z

@humitos I'm still not convinced this complexity is needed. Is this an issue we are hitting a lot, and this change brings benefits to the user?

I think I'm OK with this approach in theory, but I'm not sure how often this is actually an issue for us. Presumably it only happens when a queue backup of 2+ hours happens? In general we should just try not to have situations where builds go missing, instead of a lot of complex logic to fix state in that case.

humitos · 2022-03-15T10:32:07Z

@ericholscher this is not strictly required. However, it solves some issues we do have and it improves UX as well.

The issue it solves is that when a build is not finished for any reason (e.g. the worker got disconnected, the worker was killed by the os, the AWS worker instance was killed, etc), that build will be in a non-finished state for +2hs currently, consuming 1 of the concurrent builds allowed for that user during all that time until it's finished by this task. It had happened that 2 builds at the same time were in this non-finished state for the same project and the user was blocked for +2hs without being able to trigger a new build.

Note that "the AWS working instance is killed" is not happening too often right now because we are not using spot instances currently. In the spot instances scenario, I expect this to happen more often and it's likely that more builds will be into that state. Now, it mostly happens on scale in actions and on deploys.

Ideally, we should listen to AWS signal to kill the instance (https://github.com/readthedocs/readthedocs-ops/issues/963) and do something smarter. However, we will always need a fallback mechanism to finish a stale build --as we currently have.

I'm fine not merging this PR if we don't feel comfortable with it, but we need to be aware of the issue and figure out how to solve it in a smarter way, improve the UX in those cases where the user is blocked and avoid the confusion of waiting for a build to finish when it's never going to.

humitos · 2022-03-15T10:39:39Z

I opened this PR #9015 to track how often the worker is being killed and the build is being reset/re-tried because of this happens. That will give us some data about this case and how it impacts for on-demand vs. spot instances.

ericholscher · 2022-03-15T22:47:46Z

@humitos The build concurrency point is a good one. I think I'm OK moving forward with this then, and seeing how it works in real life. Lowering the timeout seems like it has meaningful user benefit.

humitos · 2022-04-05T07:31:59Z

Another point that we just found where this PR could help to mitigate, or at least, reduce the issue is that having "blocked builds" will count as "running builds" for our metrics, and our scale policy makes decisions based on that.

So, if we have 10 "blocked builds" (counting as 10 "running builds") we will scale out our build instances. However, these new instances are not needed because those "blocked builds" are not really "running builds". They are failed builds already finished with the wrong Build.state that was not updated for some reason.

humitos · 2023-02-03T06:30:05Z

We are having some issues in commercial with some projects that require +6hs to build. This PR could help those projects to avoid cancelling them while building cc @agjohnson

agjohnson · 2023-02-03T19:38:37Z

Aye perhaps this might help their case. It's not clear if they will continue running into problems yet, the one project hitting this limit is talking about fixing their project as well.

If we aren't yet sure on this, we could feature flag this behavior too perhaps (if it isn't already).

humitos · 2023-08-09T08:41:08Z

I'm closing this PR because it's old and it seems it's not gonna happen soon. We can come back if we consider.

humitos mentioned this pull request Jul 1, 2021

Remove DEDUPLICATE_BUILDS feature flag #7306

Closed

humitos commented Jul 1, 2021

View reviewed changes

ericholscher requested a review from a team September 21, 2021 14:46

ericholscher approved these changes Oct 4, 2021

View reviewed changes

humitos added the Status: blocked Issue is blocked on another issue label Oct 20, 2021

humitos mentioned this pull request Dec 20, 2021

RTD PR builds get stuck #8772

Closed

humitos mentioned this pull request Feb 17, 2022

Build: explore these Celery's features #8941

Closed

humitos added 3 commits March 3, 2022 17:04

Delete non-working test

85f529a

Minor fixes

fefbe62

humitos force-pushed the humitos/finish-stale-builds branch from f86f13a to fefbe62 Compare March 3, 2022 16:08

humitos removed the Status: blocked Issue is blocked on another issue label Mar 3, 2022

Lint

97aee88

humitos force-pushed the humitos/finish-stale-builds branch from 1fb6748 to 97aee88 Compare March 3, 2022 16:22

humitos mentioned this pull request Mar 14, 2022

Build: improve concurrency limit retry message #9011

Open

humitos closed this Aug 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Builds: check celery and redis before marking a build as stale #8269

Builds: check celery and redis before marking a build as stale #8269

humitos commented Jun 16, 2021

humitos Jul 1, 2021

ericholscher left a comment

ericholscher Oct 4, 2021 •

edited

Loading

humitos Oct 5, 2021 •

edited

Loading

ericholscher Oct 5, 2021

humitos commented Oct 5, 2021

ericholscher commented Oct 5, 2021

humitos commented Oct 6, 2021

humitos commented Oct 6, 2021

humitos commented Oct 6, 2021

humitos commented Oct 20, 2021

agjohnson commented Mar 8, 2022 •

edited

Loading

ericholscher commented Mar 14, 2022 •

edited

Loading

humitos commented Mar 15, 2022

humitos commented Mar 15, 2022

ericholscher commented Mar 15, 2022

humitos commented Apr 5, 2022

humitos commented Feb 3, 2023 •

edited

Loading

agjohnson commented Feb 3, 2023

humitos commented Aug 9, 2023

Builds: check celery and redis before marking a build as stale #8269

Builds: check celery and redis before marking a build as stale #8269

Conversation

humitos commented Jun 16, 2021

humitos Jul 1, 2021

Choose a reason for hiding this comment

ericholscher left a comment

Choose a reason for hiding this comment

ericholscher Oct 4, 2021 • edited Loading

Choose a reason for hiding this comment

humitos Oct 5, 2021 • edited Loading

Choose a reason for hiding this comment

ericholscher Oct 5, 2021

Choose a reason for hiding this comment

humitos commented Oct 5, 2021

ericholscher commented Oct 5, 2021

humitos commented Oct 6, 2021

humitos commented Oct 6, 2021

humitos commented Oct 6, 2021

humitos commented Oct 20, 2021

agjohnson commented Mar 8, 2022 • edited Loading

ericholscher commented Mar 14, 2022 • edited Loading

humitos commented Mar 15, 2022

humitos commented Mar 15, 2022

ericholscher commented Mar 15, 2022

humitos commented Apr 5, 2022

humitos commented Feb 3, 2023 • edited Loading

agjohnson commented Feb 3, 2023

humitos commented Aug 9, 2023

ericholscher Oct 4, 2021 •

edited

Loading

humitos Oct 5, 2021 •

edited

Loading

agjohnson commented Mar 8, 2022 •

edited

Loading

ericholscher commented Mar 14, 2022 •

edited

Loading

humitos commented Feb 3, 2023 •

edited

Loading