Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce global, per-tenant flags and interval to evaluation threshold to control rule evaluation concurrency #8146

Merged
merged 17 commits into from
Jul 27, 2024

Conversation

gotjosh
Copy link
Contributor

@gotjosh gotjosh commented May 15, 2024

What this PR does

This change introduces two flags that control if rule evaluation concurrently.

First, you have a flag that controls the total amount of rules that can be running concurrently at any given time per ruler replica with:

-ruler.max-global-rule-evaluation-concurrency

Then, this is paired with -ruler.max-concurrent-rule-evaluations-per-tenant to control the amount of rules a single tenant is allowed to have concurrently. By default, this is 4. However, the behaviour is disabled by default because -ruler.max-global-rule-evaluation-concurrency is set to 0 by default.

Checklist

  • Tests updated.
  • Documentation added.
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX].
  • about-versioning.md updated with experimental features.

NB: I couldn't think of a way to test this without incurring in a significant effort to set a test it, but I'm happy to spend the time if we think it's worth it.

@gotjosh gotjosh requested review from a team and jdbaldry as code owners May 15, 2024 15:06
@gotjosh gotjosh marked this pull request as draft May 16, 2024 08:47
@gotjosh
Copy link
Contributor Author

gotjosh commented May 16, 2024

Going back to draft, as there are two things I need to do:

  1. Put a global limit around concurrency
  2. Mark the flags as experimental

@pracucci pracucci self-requested a review May 19, 2024 07:55
Copy link
Collaborator

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Josh for working on this. This PR does what is it says and I'm not super sure it does what we need. I see two main issues.

First of all, in a multi-tenant Mimir cluster, the concurrency is unbounded because the max concurrency is configurable on a per-tenant basis but there's no per-ruler instance max concurrency. I think this is something we should do to ensure that each ruler instance will not fire an unbounded number of concurrent queries (we still want the ruler to keep spreading queries over time as much as possible).

Second, and more tricky, the queries to run concurrently get selected randomly. What I mean is that given the concurrency is limited, there's no algorithm to decide which query should be executed concurrently and which shouldn't, among all the independent queries (the ones for which is feasible to run concurrently). Our goal is to make to sure we never miss rule group evaluations. We don't care to run concurrently queries for a rule group that evaluated every 1m and all their queries take 10s to run, because we're well below the budget. On the contrary, we want to run concurrently the queries for rule groups that are at risk of missed evaluation. I'm wondering if we can track how long it takes to evaluate each rule group and enable concurrency only for rule groups that take more than 50% of their evaluation period, as a gauge to only do it for rule groups that are at risk of misses.

pkg/ruler/compat.go Outdated Show resolved Hide resolved
pkg/ruler/ruler.go Outdated Show resolved Hide resolved
pkg/util/validation/limits.go Outdated Show resolved Hide resolved
@dimitarvdimitrov
Copy link
Contributor

The CHANGELOG has just been cut to prepare for the next Mimir release. Please rebase main and eventually move the CHANGELOG entry added / updated in this PR to the top of the CHANGELOG document. Thanks!

…urrency

This change introduces two flags that control if rule evaluation concurrently.

First, you have a flag that controls the total amount of rules that can be running concurrently at any given time per ruler replica with:

`-ruler.max-global-rule-evaluation-concurrency`

Then, this is paired with `-ruler.max-concurrent-rule-evaluations-per-tenant` to control the amount of rules a single tenant is allowed to have concurrently. By default, this is `4`. However, the behaviour is disabled by default because `-ruler.max-global-rule-evaluation-concurrency` is set to `0` by default.

Signed-off-by: gotjosh <josue.abreu@gmail.com>
Signed-off-by: gotjosh <josue.abreu@gmail.com>
Signed-off-by: gotjosh <josue.abreu@gmail.com>
Signed-off-by: gotjosh <josue.abreu@gmail.com>
Signed-off-by: gotjosh <josue.abreu@gmail.com>
Signed-off-by: gotjosh <josue.abreu@gmail.com>
@gotjosh gotjosh force-pushed the enable-rule-group-concurrency branch from 0d17054 to f4f29d8 Compare July 25, 2024 14:09
@gotjosh gotjosh marked this pull request as ready for review July 25, 2024 14:36
@gotjosh gotjosh requested a review from tacole02 as a code owner July 25, 2024 14:36
@gotjosh
Copy link
Contributor Author

gotjosh commented Jul 25, 2024

@pracucci I have completely re-worked this PR to address all your concerns. You'll notice during the review but now we have:

  1. Global concurrency limits that ensure we don't go over this limit even if a given tenant has concurrency slots available.
  2. Per tenant concurrency limits that ensure no single tenant can occupy more than their allocated allowance of slots.
  3. We also have two checks determining whether a rule is eligible for a concurrency slot. 1. Just like Prometheus, the rule must not depend on or have any dependents and 2. It needs to be at risk of missing its evaluation interval by having its total group time exceed 50% of it's interval.

Signed-off-by: gotjosh <josue.abreu@gmail.com>
Signed-off-by: gotjosh <josue.abreu@gmail.com>
Copy link
Contributor

@tacole02 tacole02 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

CHANGELOG.md Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
Copy link
Collaborator

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work Josh! Many nits but a couple of important things in the concurrency controller. Thanks!

pkg/ruler/ruler.go Outdated Show resolved Hide resolved
pkg/util/validation/limits.go Outdated Show resolved Hide resolved
pkg/util/validation/limits.go Outdated Show resolved Hide resolved
pkg/util/validation/limits.go Outdated Show resolved Hide resolved
pkg/ruler/rule_concurrency.go Show resolved Hide resolved
pkg/ruler/rule_concurrency.go Show resolved Hide resolved
pkg/ruler/rule_concurrency.go Outdated Show resolved Hide resolved
pkg/ruler/rule_concurrency.go Outdated Show resolved Hide resolved
pkg/ruler/rule_concurrency.go Outdated Show resolved Hide resolved
pkg/ruler/rule_concurrency.go Outdated Show resolved Hide resolved
@pracucci
Copy link
Collaborator

There's another thing I forgot to mention. We never purge the tenantConcurrency in case of tenant inactivity. I don't think we have to address it in this PR, but I suggest it to address it in a following PR. I have a relatively simple idea on how to do it, we can talk on Slack about it.

Co-authored-by: Taylor C <41653732+tacole02@users.noreply.github.com>
@gotjosh gotjosh force-pushed the enable-rule-group-concurrency branch from 0f30e3e to fe2641b Compare July 26, 2024 12:49
@gotjosh gotjosh force-pushed the enable-rule-group-concurrency branch 3 times, most recently from e60f1d7 to 02c2430 Compare July 26, 2024 13:10
…verride

Signed-off-by: gotjosh <josue.abreu@gmail.com>
@gotjosh gotjosh force-pushed the enable-rule-group-concurrency branch from 02c2430 to de2e543 Compare July 26, 2024 13:27
…rimental

Signed-off-by: gotjosh <josue.abreu@gmail.com>
Signed-off-by: gotjosh <josue.abreu@gmail.com>
Signed-off-by: gotjosh <josue.abreu@gmail.com>

// DynamicSemaphore is a semaphore that can dynamically change its max concurrency.
// It is necessary as the max concurrency is defined by the user limits which can be changed at runtime.
type DynamicSemaphore struct {
Copy link
Member

@jhalterman jhalterman Jul 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be good to move to dskit when possible.

Copy link
Collaborator

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job, thanks! I'm approving so you can merged. I left another comment on the metric: I'm still convinced we're not tracking in the right place, and I explained my rationale.

pkg/ruler/rule_concurrency.go Outdated Show resolved Hide resolved
pkg/ruler/ruler.go Outdated Show resolved Hide resolved
Signed-off-by: gotjosh <josue.abreu@gmail.com>
- rename the threshold flag to include the suffix percentange
- adjust the variable names for the threhold accordingly
- incorporate taylor's feedback on the changelog

Signed-off-by: gotjosh <josue.abreu@gmail.com>
Signed-off-by: gotjosh <josue.abreu@gmail.com>
Signed-off-by: gotjosh <josue.abreu@gmail.com>
@gotjosh gotjosh changed the title Introduce global and per-tenant flags to control rule evaluation concurrency Introduce global, per-tenant flags and interval to evaluation threshold to control rule evaluation concurrency Jul 27, 2024
@gotjosh gotjosh merged commit bd84528 into main Jul 27, 2024
29 checks passed
@gotjosh gotjosh deleted the enable-rule-group-concurrency branch July 27, 2024 09:09
@gotjosh gotjosh mentioned this pull request Jul 29, 2024
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants