-
Notifications
You must be signed in to change notification settings - Fork 524
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
indexheader: keep pre-shutdown snapshot between restarts #8281
indexheader: keep pre-shutdown snapshot between restarts #8281
Conversation
cbe2d95
to
63c1579
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure sure about solving the problem this way. Have you considered guaranteeing that we never write the snapshot file while the store-gateway is starting up?
63c1579
to
486c033
Compare
@pracucci @dimitarvdimitrov I've sketched how this can look like, following the discussion with Marco. As noted, the primary challenge here is that
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As noted, the primary challenge here is that
BucketStore
doesn't know, in general, if theReaderPool
it created needs sharpshoting (becauseReaderPool
can be not-lazy).
did I read the code right that now we just persist the list of blocks regardless whether they are lazy-loaded or not? I'm fine with this approach
486c033
to
925aa09
Compare
Yes, |
this means that now if you enable lazy loading, the store-gateway will keep the blocks loaded from when lazy loading was disabled. I think it's worth mentioning this in the CLI flag for lazy loading. What do you think? |
I was wrong. We only persist the lazy loaded blocks. So upon a restart we'd just start with no eagerly loaded blocks, which is what happens today too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we're missing the case where a store-gateway discovers a tenant after starting up. Also that leaking goroutine isn't great, I'm not sure how hard it would be to solve it.
pkg/storegateway/bucket.go
Outdated
s.indexHeadersSnapshotter = indexheader.NewSnapshotter(s.logger, snapConfig) | ||
|
||
var lazyLoadedBlocks map[ulid.ULID]int64 | ||
if bucketStoreConfig.IndexHeader.EagerLoadingStartupEnabled { | ||
lazyLoadedBlocks = s.indexHeadersSnapshotter.RestoreLoadedBlocks() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's starting to feel like the ReaderPool
should be a service on its own with its own lifecycle (start
, run
, stop
). This way we're leaking this to other components like the BucketStore.
But just now I saw that BucketStore and BucketStores doesn't manage their own lifecycle either (nothing calls Stop
on them for example), so doing this just for ReaderPool probably won't bring too much value as-is
That said, I think it's nothing actionable unfortuantely
pkg/storegateway/bucket.go
Outdated
if err := s.indexHeadersSnapshotter.Start(ctx, s.indexReaderPool); err != nil { | ||
return errors.Wrap(err, "start index headers snapshotter") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this is called in InitialSync
which runs when the store-gateway starts up. What would happen if the tenant was discovered after the store-gateway completed its initial sync?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great point, I didn't think about that. After judo'ing with the code for a bit, I didn't manage to find a better way to resolve that than with moving the call to snapshot.Start
inside sync. Please, have look through the latest changes.
// Similar to the one above, store-gateway BucketStore starts a goroutine in snapshotter | ||
// that manages lazy-loaded snapshots. It stops the snapshotter on BucketStore close. | ||
goleak.IgnoreTopFunction("github.com/grafana/mimir/pkg/storegateway/indexheader.(*Snapshotter).Start.func1"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The store is closed only when the tenant no longer shards to this store-gateway, which happens fairly rarely. This pushes me more towards having a proper lifecycle. That goroutine doesn't have a good reason to leak. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean to have a proper lifecycle in BucketStore
(and maybe BucketStores
) so they managed the lifecycles of ReaderPool
and Snapshotter
? As you mentioned in #8281 (comment) (and what I'd bumped into when looked at it), this might be too large of a change for the expected gain.
I'm not opposed to doing that, though — refactoring can be fun 🕺
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep that was my idea. Make BucketStores
maintain the lifecycle of the BucketStore
instances that it spawns - starting them when they come up, shutting them down when the store-gateway process shuts down. And then doing the appropriate things on these triggers (like starting the snapshotter; maybe even doing the initial sync as part of starting, which would make it slightly cleaner to use the lazy loaded snapshot -- for that we might bring back the snapshotter under ReaderPool). Maybe we can start simple and just managing the snapshotter and readerPool from BucketStore (and transitively from BucketStores)?
But I agree it's some refactoring. I'm happy to merge this PR so we catch the 2.13 and r294 releases on Monday, as long as we figure out how to make it work for tenants which are discovered (the comment above). The rest is just code structure.
c7b388d
to
0dc3231
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for addressing my feedback.
@pracucci do you want to take another look or can we merge this?
BucketStore lifecycle refactor
I'll be somewhat busy with the 2.13 release this week, so I am not sure about when I'll be able to start with the bucket store start/stop lifecycle refactor. In case you can start this earlier I opened this issue #8389 for wheover gets to it first
7d8bb9b
to
69ab5d6
Compare
I'm taking a quick look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Snapshotter
design LGTM, and it should fix the issue we want to fix.
It's important that the snapshot filepath hasn't changed to keep it backward compatible (we want to reload the previous file after rolling out this change), but I think it didn't change, so LGTM.
I left a couple of comments. One is about the design in reader pool, which I think should be re-iterated on too, but it's definitely not a blocker for this PR. The other comment is about what happens if we fail to start the snapshotter. I'm not super convinced about how we handle it.
Not explicitly approving this PR because I haven't done a super accurate review. I defer to Dimitar on this.
logger log.Logger | ||
conf SnapshotterConfig | ||
|
||
stop chan struct{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why isn't Snapshotter
a services.Service
? I think most of Snapshotter.Start()
could go away simply using a timer service.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the problem is that you need to pass blocksLoader
to Start()
, which is something you can't do with services.Service
. Having to pass blocksLoader
is a design bad smell to me. The bad smell originates from the fact that we have to pass loadedBlocks
to NewReaderPool()
.
Reader pool should be unaware of preShutdownLoadedBlocks
. The loading of previously loaded blocks should triggered outside of the reader pool, and not in the reader pool itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't address this comment in this PR, but I think we should re-iterate on the design to further clean it up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we can use #8389 as a venue to address some of this
pkg/storegateway/bucket.go
Outdated
// We do that here so the snapshotter watched after blocks from both initial sync and those discovered later. | ||
var err error | ||
s.snapshotterStartOnce.Do(func() { | ||
err = s.snapshotter.Start(ctx, s.indexReaderPool) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the Start()
fails, then the snapshotter will never run. Is it what we want?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not trying to start a once failed snapshotter seems like a safe play:
- for an initial sync this will propagate into fail of the service
- for a block discovered at later stages, this will prevent saving the lazy-loaded snapshots for the block (and a log message)
In contrast, I'm not sure that I'd expect that snapshotter.Start
can be called more than once per instance from a code-reader's POV.
I don't feel strong about that, though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've chatted about this one with @dimitarvdimitrov, and decided it's not worth it failing the Snapshotter.Start
. The previous version never did it also. I've updated the code.
e1114e3
to
968e334
Compare
Signed-off-by: Vladimir Varankin <vladimir.varankin@grafana.com>
Signed-off-by: Vladimir Varankin <vladimir.varankin@grafana.com>
Signed-off-by: Vladimir Varankin <vladimir.varankin@grafana.com>
Signed-off-by: Vladimir Varankin <vladimir.varankin@grafana.com>
Signed-off-by: Vladimir Varankin <vladimir.varankin@grafana.com>
Signed-off-by: Vladimir Varankin <vladimir.varankin@grafana.com>
Signed-off-by: Vladimir Varankin <vladimir.varankin@grafana.com>
Signed-off-by: Vladimir Varankin <vladimir.varankin@grafana.com>
Signed-off-by: Vladimir Varankin <vladimir.varankin@grafana.com>
Signed-off-by: Vladimir Varankin <vladimir.varankin@grafana.com>
Signed-off-by: Vladimir Varankin <vladimir.varankin@grafana.com>
Signed-off-by: Vladimir Varankin <vladimir.varankin@grafana.com> Co-authored-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>
5ff5ffb
to
8ce3f22
Compare
What this PR does
The PR updates
indexheader
in a way soReaderPool
preserved the blocks found in the pre-shutdown snapshot.That it, whenReaderPool
persists its own set of blocks tolazy-loaded.json
, it also amends the set with the blocks, it eagerly loaded (or still loading) on startup. The changes make sure we only persist "fresh" blocks, so idle blocks will be evicted from the snapshots, eventually.Which issue(s) this PR fixes or relates to
Fixes #8255
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]
.about-versioning.md
updated with experimental features.