Introduce global, per-tenant flags and interval to evaluation threshold to control rule evaluation concurrency #8146

gotjosh · 2024-05-15T15:06:39Z

What this PR does

This change introduces two flags that control if rule evaluation concurrently.

First, you have a flag that controls the total amount of rules that can be running concurrently at any given time per ruler replica with:

-ruler.max-global-rule-evaluation-concurrency

Then, this is paired with -ruler.max-concurrent-rule-evaluations-per-tenant to control the amount of rules a single tenant is allowed to have concurrently. By default, this is 4. However, the behaviour is disabled by default because -ruler.max-global-rule-evaluation-concurrency is set to 0 by default.

Checklist

Tests updated.
Documentation added.
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX].
about-versioning.md updated with experimental features.

NB: I couldn't think of a way to test this without incurring in a significant effort to set a test it, but I'm happy to spend the time if we think it's worth it.

gotjosh · 2024-05-16T08:49:15Z

Going back to draft, as there are two things I need to do:

Put a global limit around concurrency
Mark the flags as experimental

pracucci

Thanks Josh for working on this. This PR does what is it says and I'm not super sure it does what we need. I see two main issues.

First of all, in a multi-tenant Mimir cluster, the concurrency is unbounded because the max concurrency is configurable on a per-tenant basis but there's no per-ruler instance max concurrency. I think this is something we should do to ensure that each ruler instance will not fire an unbounded number of concurrent queries (we still want the ruler to keep spreading queries over time as much as possible).

Second, and more tricky, the queries to run concurrently get selected randomly. What I mean is that given the concurrency is limited, there's no algorithm to decide which query should be executed concurrently and which shouldn't, among all the independent queries (the ones for which is feasible to run concurrently). Our goal is to make to sure we never miss rule group evaluations. We don't care to run concurrently queries for a rule group that evaluated every 1m and all their queries take 10s to run, because we're well below the budget. On the contrary, we want to run concurrently the queries for rule groups that are at risk of missed evaluation. I'm wondering if we can track how long it takes to evaluate each rule group and enable concurrency only for rule groups that take more than 50% of their evaluation period, as a gauge to only do it for rule groups that are at risk of misses.

pkg/ruler/compat.go

pkg/ruler/ruler.go

pkg/util/validation/limits.go

dimitarvdimitrov · 2024-06-21T13:36:35Z

The CHANGELOG has just been cut to prepare for the next Mimir release. Please rebase main and eventually move the CHANGELOG entry added / updated in this PR to the top of the CHANGELOG document. Thanks!

…urrency This change introduces two flags that control if rule evaluation concurrently. First, you have a flag that controls the total amount of rules that can be running concurrently at any given time per ruler replica with: `-ruler.max-global-rule-evaluation-concurrency` Then, this is paired with `-ruler.max-concurrent-rule-evaluations-per-tenant` to control the amount of rules a single tenant is allowed to have concurrently. By default, this is `4`. However, the behaviour is disabled by default because `-ruler.max-global-rule-evaluation-concurrency` is set to `0` by default. Signed-off-by: gotjosh <josue.abreu@gmail.com>

Signed-off-by: gotjosh <josue.abreu@gmail.com>

gotjosh · 2024-07-25T14:40:37Z

@pracucci I have completely re-worked this PR to address all your concerns. You'll notice during the review but now we have:

Global concurrency limits that ensure we don't go over this limit even if a given tenant has concurrency slots available.
Per tenant concurrency limits that ensure no single tenant can occupy more than their allocated allowance of slots.
We also have two checks determining whether a rule is eligible for a concurrency slot. 1. Just like Prometheus, the rule must not depend on or have any dependents and 2. It needs to be at risk of missing its evaluation interval by having its total group time exceed 50% of it's interval.

Signed-off-by: gotjosh <josue.abreu@gmail.com>

tacole02

Thank you!

CHANGELOG.md

docs/sources/mimir/configure/configuration-parameters/index.md

pracucci

Nice work Josh! Many nits but a couple of important things in the concurrency controller. Thanks!

pkg/ruler/ruler.go

pkg/util/validation/limits.go

pkg/ruler/rule_concurrency.go

pracucci · 2024-07-26T06:23:51Z

There's another thing I forgot to mention. We never purge the tenantConcurrency in case of tenant inactivity. I don't think we have to address it in this PR, but I suggest it to address it in a following PR. I have a relatively simple idea on how to do it, we can talk on Slack about it.

Co-authored-by: Taylor C <41653732+tacole02@users.noreply.github.com>

…verride Signed-off-by: gotjosh <josue.abreu@gmail.com>

pkg/util/validation/limits.go

…rimental Signed-off-by: gotjosh <josue.abreu@gmail.com>

Signed-off-by: gotjosh <josue.abreu@gmail.com>

jhalterman · 2024-07-26T17:09:46Z

pkg/ruler/rule_concurrency.go

+
+// DynamicSemaphore is a semaphore that can dynamically change its max concurrency.
+// It is necessary as the max concurrency is defined by the user limits which can be changed at runtime.
+type DynamicSemaphore struct {


This would be good to move to dskit when possible.

pracucci

Nice job, thanks! I'm approving so you can merged. I left another comment on the metric: I'm still convinced we're not tracking in the right place, and I explained my rationale.

pkg/ruler/rule_concurrency.go

pkg/ruler/ruler.go

Signed-off-by: gotjosh <josue.abreu@gmail.com>

- rename the threshold flag to include the suffix percentange - adjust the variable names for the threhold accordingly - incorporate taylor's feedback on the changelog Signed-off-by: gotjosh <josue.abreu@gmail.com>

Signed-off-by: gotjosh <josue.abreu@gmail.com>

gotjosh requested review from a team and jdbaldry as code owners May 15, 2024 15:06

gotjosh marked this pull request as draft May 16, 2024 08:47

pracucci self-requested a review May 19, 2024 07:55

pracucci reviewed May 19, 2024

View reviewed changes

pkg/ruler/compat.go Outdated Show resolved Hide resolved

pkg/ruler/ruler.go Outdated Show resolved Hide resolved

pkg/util/validation/limits.go Outdated Show resolved Hide resolved

dimitarvdimitrov added the release/notified-changelog-cut label Jun 21, 2024

gotjosh force-pushed the enable-rule-group-concurrency branch from b656a68 to 424fbd1 Compare July 19, 2024 09:39

gotjosh removed the request for review from jdbaldry July 23, 2024 10:32

gotjosh force-pushed the enable-rule-group-concurrency branch from 112d186 to cbf46fd Compare July 25, 2024 12:36

gotjosh removed the release/notified-changelog-cut label Jul 25, 2024

gotjosh added 6 commits July 25, 2024 14:52

appease the linter

ac127c2

Signed-off-by: gotjosh <josue.abreu@gmail.com>

lint: Use Uber's atomic instead of sync

d4ad00f

Signed-off-by: gotjosh <josue.abreu@gmail.com>

Add license header

da3a559

Signed-off-by: gotjosh <josue.abreu@gmail.com>

Test isRuleIndependent

0ba2e16

Signed-off-by: gotjosh <josue.abreu@gmail.com>

lint: don't pass empty labels

f4f29d8

Signed-off-by: gotjosh <josue.abreu@gmail.com>

gotjosh force-pushed the enable-rule-group-concurrency branch from 0d17054 to f4f29d8 Compare July 25, 2024 14:09

gotjosh marked this pull request as ready for review July 25, 2024 14:36

gotjosh requested a review from tacole02 as a code owner July 25, 2024 14:36

gotjosh added 2 commits July 25, 2024 15:47

update changelog

29b1b5f

Signed-off-by: gotjosh <josue.abreu@gmail.com>

Add metrics to the changelog

ddeb56e

Signed-off-by: gotjosh <josue.abreu@gmail.com>

tacole02 approved these changes Jul 25, 2024

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

CHANGELOG.md Outdated Show resolved Hide resolved

docs/sources/mimir/configure/configuration-parameters/index.md Outdated Show resolved Hide resolved

pracucci reviewed Jul 26, 2024

View reviewed changes

Update CHANGELOG.md

d403bef

Co-authored-by: Taylor C <41653732+tacole02@users.noreply.github.com>

gotjosh force-pushed the enable-rule-group-concurrency branch from 0f30e3e to fe2641b Compare July 26, 2024 12:49

gotjosh force-pushed the enable-rule-group-concurrency branch 3 times, most recently from e60f1d7 to 02c2430 Compare July 26, 2024 13:10

Rename both configuration for the ruler instance and the per tenant o…

de2e543

…verride Signed-off-by: gotjosh <josue.abreu@gmail.com>

gotjosh force-pushed the enable-rule-group-concurrency branch from 02c2430 to de2e543 Compare July 26, 2024 13:27

pracucci reviewed Jul 26, 2024

View reviewed changes

pkg/util/validation/limits.go Outdated Show resolved Hide resolved

gotjosh added 3 commits July 26, 2024 16:12

Update about-versioning.md and make sure the flag is marked as expe…

421acfa

…rimental Signed-off-by: gotjosh <josue.abreu@gmail.com>

address review comments in rule_concurrency.go

004f3a1

Signed-off-by: gotjosh <josue.abreu@gmail.com>

Make threshold for group at risk configurable and test it

f419dc0

Signed-off-by: gotjosh <josue.abreu@gmail.com>

jhalterman reviewed Jul 26, 2024

View reviewed changes

pracucci approved these changes Jul 27, 2024

View reviewed changes

pkg/ruler/rule_concurrency.go Outdated Show resolved Hide resolved

pkg/ruler/ruler.go Outdated Show resolved Hide resolved

gotjosh added 4 commits July 27, 2024 09:04

move total metric increment to after we know is eligble.

b08738d

Signed-off-by: gotjosh <josue.abreu@gmail.com>

address review comments

c4f16e7

- rename the threshold flag to include the suffix percentange - adjust the variable names for the threhold accordingly - incorporate taylor's feedback on the changelog Signed-off-by: gotjosh <josue.abreu@gmail.com>

update docs

79c5590

Signed-off-by: gotjosh <josue.abreu@gmail.com>

add threshold control to about-versioning.md

43c3ede

Signed-off-by: gotjosh <josue.abreu@gmail.com>

gotjosh changed the title ~~Introduce global and per-tenant flags to control rule evaluation concurrency~~ Introduce global, per-tenant flags and interval to evaluation threshold to control rule evaluation concurrency Jul 27, 2024

gotjosh merged commit bd84528 into main Jul 27, 2024
29 checks passed

gotjosh deleted the enable-rule-group-concurrency branch July 27, 2024 09:09

gotjosh mentioned this pull request Jul 29, 2024

End to end test rule concurrency #8846

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce global, per-tenant flags and interval to evaluation threshold to control rule evaluation concurrency #8146

Introduce global, per-tenant flags and interval to evaluation threshold to control rule evaluation concurrency #8146

gotjosh commented May 15, 2024 •

edited

Loading

gotjosh commented May 16, 2024

pracucci left a comment

dimitarvdimitrov commented Jun 21, 2024

gotjosh commented Jul 25, 2024

tacole02 left a comment

pracucci left a comment

pracucci commented Jul 26, 2024

jhalterman Jul 26, 2024 •

edited

Loading

pracucci left a comment

Introduce global, per-tenant flags and interval to evaluation threshold to control rule evaluation concurrency #8146

Introduce global, per-tenant flags and interval to evaluation threshold to control rule evaluation concurrency #8146

Conversation

gotjosh commented May 15, 2024 • edited Loading

What this PR does

Checklist

gotjosh commented May 16, 2024

pracucci left a comment

Choose a reason for hiding this comment

dimitarvdimitrov commented Jun 21, 2024

gotjosh commented Jul 25, 2024

tacole02 left a comment

Choose a reason for hiding this comment

pracucci left a comment

Choose a reason for hiding this comment

pracucci commented Jul 26, 2024

jhalterman Jul 26, 2024 • edited Loading

Choose a reason for hiding this comment

pracucci left a comment

Choose a reason for hiding this comment

gotjosh commented May 15, 2024 •

edited

Loading

jhalterman Jul 26, 2024 •

edited

Loading