Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Builds: check celery and redis before marking a build as stale #8269

Closed
wants to merge 4 commits into from

Conversation

humitos
Copy link
Member

@humitos humitos commented Jun 16, 2021

Previously, we were checking only if the build has been running for a long-long
time (2hs) before marking it as stale.

This new approach tries to reduce this time to and adds extra checks on top of
that:

  • the build has been running for more than 15 minutes
  • the build is not being executed by Celery (celery.inspect.active)
  • the build is not queued in Redis queues

When all of these are True we mark the build as stale.

Closes #4386


These inactive builds will be marked as ``success`` and ``FINISHED`` with an
- it's not in ``FINISHED`` state
- it was created +15 minutes ago
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can reduce this time. If the task is not queued nor being executed, it already means that it's inactive this particular build won't change its state.

@ericholscher ericholscher requested a review from a team September 21, 2021 14:46
Copy link
Member

@ericholscher ericholscher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels super error-prone and implementation specific. Especially something that doesn't have any tests, I'm 99% sure this will break within a year because of a change in Celery or Redis, and we will likely never know. I don't know that it's really a useful solution to this problem, but I guess we can try it -- but I don't know that we'll even know if it's working properly or not, to be honest.

build_stale = True

# 1. check if it's being executed by celery
for queue, tasks in app.control.inspect().active():
Copy link
Member

@ericholscher ericholscher Oct 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this reliably return all servers in production? I feel like in the past that this has often missed many of the running builders.

Copy link
Member Author

@humitos humitos Oct 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember that I tested this and was accurate. I don't know how trustable it may be, tho.

Another quick test today in production showed accurate numbers at least:

# number of builds marked as running in the db / numbers of builds executed by workers
In [49]: (Build.objects.exclude(state__in=('finished', 'triggered')).count(), sum([len(tasks) for queue, tasks in app.control.inspect().active
    ...: ().items() if queue.startswith('build') and all([True for task in tasks if task['name'] == 'readthedocs.projects.tasks.update_docs_ta
    ...: sk'])]))
Out[49]: (16, 16)

I don't think this will be 100% accurate at all times, tho. Mainly because the DB is immediate and the worker takes a small time to start the queued task. So, there may be a difference there. However, this should only happen at the edges (when the task is about to start or about to end) and both cases should be covered. The "about to start" because we have a minimum delta time running and the "about to end" because the task will still be present in Celery but not in DB --which won't make the algorithm to mark it as stale.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool -- I'm guessing it's less of an issue with the scaling groups, since we don't have super long-lived celery instances, which is likely what was causing the issue.

@humitos
Copy link
Member Author

humitos commented Oct 5, 2021

@ericholscher I'm fine adding some tests to check the Redis structure expected so it fails if Celery changes it for some reason. I'm also fine doing some more tests in production before deploying it.

This PR is quite old and I remember testing it locally only and opening it here to start the discussion.

If we don't feel comfortable yet, we can QA it more (locally, on production, and with test cases). However, I'd like to find a better solution than the current one that blocks users' builds for hours.

@ericholscher
Copy link
Member

@humitos Yea, our current approach is definitely not great. If we can find a way to actually detect this failing automatically, I'd be more comfortable with merging it. I don't think tests are as important here, since most of the issues we'll hit will be in differences in production instances (eg. redis gets upgraded or something, and we don't change the test suite).

@humitos
Copy link
Member Author

humitos commented Oct 6, 2021

@ericholscher other than the concerns about the future stability of it, do you think the logic/algorithm is accurate and will act over the builds we expect?

@humitos
Copy link
Member Author

humitos commented Oct 6, 2021

Hrm, actually, I just found that we had some tests for this it was marked as xfail. I tried to adapt it to the new method, but it requires a Celery instance running because the call app.control.inspect().active() and we don't have one running with the test. We could mock that, but we lose possible structure differences coming from Celery we are interested in to check.

@humitos
Copy link
Member Author

humitos commented Oct 6, 2021

@ericholscher I did an extensive QA locally for the algorithm and I adapted the code to handle edge cases and added more logging to it.

Since this runs only on web queue, we can "live hotfix" it in one web-celery instance before fully deploying it and check the logs closely to understand how it behaves. Looking that it's not finishing builds that are valid.

@humitos humitos added the Status: blocked Issue is blocked on another issue label Oct 20, 2021
@humitos
Copy link
Member Author

humitos commented Oct 20, 2021

Marked as blocked for now so we don't merge before QAing on production with a closer look.

Previously, we were checking only if the build has been running for a long-long
time (2hs) before marking it as stale.

This new approach tries to reduce this time to 15m and adds extra checks on top
of that:

- the build has been running for more than 15 minutes
- the build is not being executed by Celery (celery.inspect.active)
- the build is not queued in Redis queues

When all of these are `True` we mark the build as stale.
@humitos humitos force-pushed the humitos/finish-stale-builds branch from f86f13a to fefbe62 Compare March 3, 2022 16:08
@humitos humitos removed the Status: blocked Issue is blocked on another issue label Mar 3, 2022
@humitos humitos force-pushed the humitos/finish-stale-builds branch from 1fb6748 to 97aee88 Compare March 3, 2022 16:22
@agjohnson
Copy link
Contributor

agjohnson commented Mar 8, 2022

There were concerns on this before this work stalled out, and this work wasn't on our sprint, so I am going to remove this from the current release project. I'd like more eyes on this before merging. I'm concerned that we're deep into the celery api interactions, an area that has caused trouble for us in the past.

I'll put it on next release for now

@ericholscher
Copy link
Member

ericholscher commented Mar 14, 2022

@humitos I'm still not convinced this complexity is needed. Is this an issue we are hitting a lot, and this change brings benefits to the user?

I think I'm OK with this approach in theory, but I'm not sure how often this is actually an issue for us. Presumably it only happens when a queue backup of 2+ hours happens? In general we should just try not to have situations where builds go missing, instead of a lot of complex logic to fix state in that case.

@humitos
Copy link
Member Author

humitos commented Mar 15, 2022

@ericholscher this is not strictly required. However, it solves some issues we do have and it improves UX as well.

The issue it solves is that when a build is not finished for any reason (e.g. the worker got disconnected, the worker was killed by the os, the AWS worker instance was killed, etc), that build will be in a non-finished state for +2hs currently, consuming 1 of the concurrent builds allowed for that user during all that time until it's finished by this task. It had happened that 2 builds at the same time were in this non-finished state for the same project and the user was blocked for +2hs without being able to trigger a new build.

Note that "the AWS working instance is killed" is not happening too often right now because we are not using spot instances currently. In the spot instances scenario, I expect this to happen more often and it's likely that more builds will be into that state. Now, it mostly happens on scale in actions and on deploys.

Ideally, we should listen to AWS signal to kill the instance (https://github.com/readthedocs/readthedocs-ops/issues/963) and do something smarter. However, we will always need a fallback mechanism to finish a stale build --as we currently have.

I'm fine not merging this PR if we don't feel comfortable with it, but we need to be aware of the issue and figure out how to solve it in a smarter way, improve the UX in those cases where the user is blocked and avoid the confusion of waiting for a build to finish when it's never going to.

@humitos
Copy link
Member Author

humitos commented Mar 15, 2022

I opened this PR #9015 to track how often the worker is being killed and the build is being reset/re-tried because of this happens. That will give us some data about this case and how it impacts for on-demand vs. spot instances.

@ericholscher
Copy link
Member

@humitos The build concurrency point is a good one. I think I'm OK moving forward with this then, and seeing how it works in real life. Lowering the timeout seems like it has meaningful user benefit.

@humitos
Copy link
Member Author

humitos commented Apr 5, 2022

Another point that we just found where this PR could help to mitigate, or at least, reduce the issue is that having "blocked builds" will count as "running builds" for our metrics, and our scale policy makes decisions based on that.

So, if we have 10 "blocked builds" (counting as 10 "running builds") we will scale out our build instances. However, these new instances are not needed because those "blocked builds" are not really "running builds". They are failed builds already finished with the wrong Build.state that was not updated for some reason.

@humitos
Copy link
Member Author

humitos commented Feb 3, 2023

We are having some issues in commercial with some projects that require +6hs to build. This PR could help those projects to avoid cancelling them while building cc @agjohnson

@agjohnson
Copy link
Contributor

Aye perhaps this might help their case. It's not clear if they will continue running into problems yet, the one project hitting this limit is talking about fixing their project as well.

If we aren't yet sure on this, we could feature flag this behavior too perhaps (if it isn't already).

@humitos
Copy link
Member Author

humitos commented Aug 9, 2023

I'm closing this PR because it's old and it seems it's not gonna happen soon. We can come back if we consider.

@humitos humitos closed this Aug 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Task to finish build due inactivity has an edge case
3 participants