-
Notifications
You must be signed in to change notification settings - Fork 524
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement series limit using ingester own series #6718
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice job! The overall logic is what I expected, so no surprises. I left few minor comments. Looking forward for the tests!
@@ -629,7 +629,7 @@ func (t *Mimir) initIngesterService() (serv services.Service, err error) { | |||
t.Cfg.Ingester.InstanceLimitsFn = ingesterInstanceLimits(t.RuntimeConfig) | |||
t.tsdbIngesterConfig() | |||
|
|||
t.Ingester, err = ingester.New(t.Cfg.Ingester, t.Overrides, t.ActiveGroupsCleanup, t.Registerer, util_log.Logger) | |||
t.Ingester, err = ingester.New(t.Cfg.Ingester, t.Overrides, t.Ring, t.ActiveGroupsCleanup, t.Registerer, util_log.Logger) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[non blocking] Unrelated. In a separate PR would be nice to rename t.Ring
to t.IngesterRing
. This naming was done in a era when we only had ingesters ring, but now we have many rings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pkg/ingester/user_tsdb.go
Outdated
ownedPrev, shardSizePrev := u.OwnedSeriesAndShards() | ||
|
||
var ownedNew int | ||
for { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could potentially lead to an infinite loop, if new series are added continuously. I think it's a bad idea. I would limit the max number of attempts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it can lead to infinite loop, because eventually user will hit the limit :) But I agree with your proposal to allow small difference (say up to 100 series). We can also limit the attempts, set whatever value we have, but also report error to retry later. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a best practice we should never have a potentially infinite loop. I don't want to even reason whether could be infinite or not because of "other conditions outside of the loop". However, I've seen you adding a max retries, so LGTM.
cd43d14
to
f558490
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Remember a CHANGELOG
- Remember to list new experimental config params to
docs/sources/mimir/configure/about-versioning.md
} | ||
|
||
// Updates token ranges and recomputes owned series for user, if necessary. If recomputation happened, true is returned. | ||
func (oss *ownedSeriesService) updateTenant(userID string, db *userTSDB, ringChanged bool) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find the logic around reason
a bit cumbersome here. Let's talk offline about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it is cumbersome because it covers many different scenarios. I've extended the comment to explain which scenarios we cover, to help understand why it is like this.
…ce as a dependency
… series. Handle situation when number of series changed while recomputing owned series. Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Update token ranges for new users soon, even if there was no ring change. Make sure to update shard size if it changed. Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
…ged, but trigger was set. Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
…nts. Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
…ries difference. Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
fa77048
to
800e1fb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice job! I haven't found any issue. I left few last minor comments. Thanks!
ownedSeriesCheckDuration: promauto.With(reg).NewHistogram(prometheus.HistogramOpts{ | ||
Name: "cortex_ingester_owned_series_check_duration", | ||
Help: "How long does it take to check for owned series for all users.", | ||
Buckets: prometheus.DefBuckets, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't think default buckets are appropriate here. We expect way lower timings (e.g. 10s as highest bucket looks infinitely high). I would customise buckets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm worried that on some of our large cells, this can indeed take several seconds. Let's measure and adjust buckets once we see it working.
pkg/ingester/user_tsdb.go
Outdated
recomputeOwnedSeriesMaxSeriesDiff = 1000 | ||
) | ||
|
||
func (u *userTSDB) recomputeOwnedSeriesWithComputeFn(shardSize int, reason string, logger log.Logger, compute func() int) (retry bool, _ int) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit] It looks a bit weird to me calling the return param retry
(even in the callers) when it's effectively a failed
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like calling return value "failed" (I prefer "success"), hence I opted for "retry". I will rename it to success
instead.
pkg/ingester/user_tsdb.go
Outdated
u.ownedSeriesMtx.Unlock() | ||
} | ||
|
||
level.Info(logger).Log("msg", "owned series: recomputed owned series for user", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think level should change to warning if we failed to update it after all attempts.
pkg/ingester/user_tsdb.go
Outdated
u.ownedSeriesMtx.Unlock() | ||
} | ||
|
||
level.Info(logger).Log("msg", "owned series: recomputed owned series for user", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[non blocking] Do we really have to log it for every tenant? I guess it can help you debugging in case of issues for now, but we may consider getting rid of it once feature will be stable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree to remove it in the future.
pkg/ingester/user_tsdb.go
Outdated
// Check how many new series were added while we were computing owned series. | ||
seriesDiff := u.ownedSeriesCount - prevOwnedSeriesCount | ||
seriesDiffOk := seriesDiff >= 0 && seriesDiff <= recomputeOwnedSeriesMaxSeriesDiff // seriesDiff should always be >= 0, but in case it isn't, we can try again. | ||
if seriesDiffOk || attempts == recomputeOwnedSeriesMaxAttemps { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[non blocking] Generally speaking I'm not a big fan of checking for for loop stop conditions inside the loop itself. Have you considered simply changing reportError
into success
? success
gets initialised to false, and switched to true once value gets updated. The for loop could simply change to for !success && attempts < recomputeOwnedSeriesMaxAttemps
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you suggesting to move update of u.ownedSeriesCount
and u.ownedSeriesShardSize
outside of for
loop? Because that would surely be nicer, but then the locking situation is more complicated. Let me try and see.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After our chat, I've reworked the method. PTAL
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
What this PR does
This PR implements limiting of user series by only using series owned by ingesters to check the limit.
How it works:
After ingester opens all local TSDBs, ingester will wait until it's "ACTIVE" in the ring. At that point, ingester will read tokens from the ring to see which series it actually "owns", and ingester will update "owned series" for all open TSDBs. After that, ingester starts accepting push requests.
When new series is created, we assume that it was assigned to this ingester because ingester "owns" it, and simply increase "owned series" for TSDB.
When series is deleted, we don't update "owned series" counter for TSDB. Deletion only happens during compaction, so we will instead recompute "owned series" for TSDB after compaction has finished.
We recompute "owned series" for TSDB periodically for each user
How user limits are affected:
Until this happens, old shard size is used for limits computation.
Credits: Original code was created by @pr00se, I continue in his work before he returns.
Which issue(s) this PR fixes or relates to
Fixes #
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]
.about-versioning.md
updated with experimental features.