-
Notifications
You must be signed in to change notification settings - Fork 524
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce global, per-tenant flags and interval to evaluation threshold to control rule evaluation concurrency #8146
Conversation
Going back to draft, as there are two things I need to do:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Josh for working on this. This PR does what is it says and I'm not super sure it does what we need. I see two main issues.
First of all, in a multi-tenant Mimir cluster, the concurrency is unbounded because the max concurrency is configurable on a per-tenant basis but there's no per-ruler instance max concurrency. I think this is something we should do to ensure that each ruler instance will not fire an unbounded number of concurrent queries (we still want the ruler to keep spreading queries over time as much as possible).
Second, and more tricky, the queries to run concurrently get selected randomly. What I mean is that given the concurrency is limited, there's no algorithm to decide which query should be executed concurrently and which shouldn't, among all the independent queries (the ones for which is feasible to run concurrently). Our goal is to make to sure we never miss rule group evaluations. We don't care to run concurrently queries for a rule group that evaluated every 1m and all their queries take 10s to run, because we're well below the budget. On the contrary, we want to run concurrently the queries for rule groups that are at risk of missed evaluation. I'm wondering if we can track how long it takes to evaluate each rule group and enable concurrency only for rule groups that take more than 50% of their evaluation period, as a gauge to only do it for rule groups that are at risk of misses.
The CHANGELOG has just been cut to prepare for the next Mimir release. Please rebase |
b656a68
to
424fbd1
Compare
112d186
to
cbf46fd
Compare
…urrency This change introduces two flags that control if rule evaluation concurrently. First, you have a flag that controls the total amount of rules that can be running concurrently at any given time per ruler replica with: `-ruler.max-global-rule-evaluation-concurrency` Then, this is paired with `-ruler.max-concurrent-rule-evaluations-per-tenant` to control the amount of rules a single tenant is allowed to have concurrently. By default, this is `4`. However, the behaviour is disabled by default because `-ruler.max-global-rule-evaluation-concurrency` is set to `0` by default. Signed-off-by: gotjosh <josue.abreu@gmail.com>
Signed-off-by: gotjosh <josue.abreu@gmail.com>
Signed-off-by: gotjosh <josue.abreu@gmail.com>
Signed-off-by: gotjosh <josue.abreu@gmail.com>
Signed-off-by: gotjosh <josue.abreu@gmail.com>
Signed-off-by: gotjosh <josue.abreu@gmail.com>
0d17054
to
f4f29d8
Compare
@pracucci I have completely re-worked this PR to address all your concerns. You'll notice during the review but now we have:
|
Signed-off-by: gotjosh <josue.abreu@gmail.com>
Signed-off-by: gotjosh <josue.abreu@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work Josh! Many nits but a couple of important things in the concurrency controller. Thanks!
There's another thing I forgot to mention. We never purge the |
Co-authored-by: Taylor C <41653732+tacole02@users.noreply.github.com>
0f30e3e
to
fe2641b
Compare
e60f1d7
to
02c2430
Compare
…verride Signed-off-by: gotjosh <josue.abreu@gmail.com>
02c2430
to
de2e543
Compare
…rimental Signed-off-by: gotjosh <josue.abreu@gmail.com>
Signed-off-by: gotjosh <josue.abreu@gmail.com>
Signed-off-by: gotjosh <josue.abreu@gmail.com>
|
||
// DynamicSemaphore is a semaphore that can dynamically change its max concurrency. | ||
// It is necessary as the max concurrency is defined by the user limits which can be changed at runtime. | ||
type DynamicSemaphore struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be good to move to dskit when possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice job, thanks! I'm approving so you can merged. I left another comment on the metric: I'm still convinced we're not tracking in the right place, and I explained my rationale.
Signed-off-by: gotjosh <josue.abreu@gmail.com>
- rename the threshold flag to include the suffix percentange - adjust the variable names for the threhold accordingly - incorporate taylor's feedback on the changelog Signed-off-by: gotjosh <josue.abreu@gmail.com>
Signed-off-by: gotjosh <josue.abreu@gmail.com>
Signed-off-by: gotjosh <josue.abreu@gmail.com>
What this PR does
This change introduces two flags that control if rule evaluation concurrently.
First, you have a flag that controls the total amount of rules that can be running concurrently at any given time per ruler replica with:
-ruler.max-global-rule-evaluation-concurrency
Then, this is paired with
-ruler.max-concurrent-rule-evaluations-per-tenant
to control the amount of rules a single tenant is allowed to have concurrently. By default, this is4
. However, the behaviour is disabled by default because-ruler.max-global-rule-evaluation-concurrency
is set to0
by default.Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]
.about-versioning.md
updated with experimental features.NB: I couldn't think of a way to test this without incurring in a significant effort to set a test it, but I'm happy to spend the time if we think it's worth it.