Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store-gateway: Add metrics for fine-grained chunks caching #4213

Closed
wants to merge 2 commits into from

Conversation

dimitarvdimitrov
Copy link
Contributor

@dimitarvdimitrov dimitarvdimitrov commented Feb 9, 2023

Signed-off-by: Dimitar Dimitrov dimitar.dimitrov@grafana.com

Which issue(s) this PR fixes or relates to

Related to #3939; upstreams part of #3968

What this PR does

Proposes changes to the metrics that the store-gateway exposes to give more granular insight into the sizes of chunks through the loading process. At this point this PR should be a noop. I'm opening more as a preliminary discussion for the followup PR which will also record these metrics.

This PR adds a stage label to these four metrics

  • cortex_bucket_store_series_data_fetched
  • cortex_bucket_store_series_data_size_fetched_bytes
  • cortex_bucket_store_series_data_touched
  • cortex_bucket_store_series_data_size_touched_bytes

The stage label doesn't have a value for series and postings, so for them these metrics work the same way. The stage label is also empty even for chunks when the store-gateway is non-streaming and when the fine-grained chunks caching for the streaming store-gateway is disabled (as of this PR this caching is still not implemented).

When the fine-grained caching is enabled the stage label will have the following values for data_type="chunks"

  • cortex_bucket_store_series_data_fetched
    • normal - the number of chunks that we had to fetch; this doesn't include refetched chunks; this includes chunks outside of the request minT/maxT
    • refetch - the number of chunks from ranges that we had to refetch because they were underfetched
  • cortex_bucket_store_series_data_size_fetched_bytes
    • normal - the raw size of chunk ranges fetched from both the cache and the bucket; this doesn't include overfetched bytes caused by the partitioner
      • I am not sure whether or not we should include the partitioner effect into this or not. If we include it, then we can make weaker conclusions about how effective our chunk length estimation is because the fetched bytes can include almost arbitrary bytes that we have little control over
    • refetch - same as normal, but includes only the raw range sizes that had to be refetched
  • cortex_bucket_store_series_data_touched
    • parsed - the number of chunks that we parsed; this is equal to cortex_bucket_store_series_data_fetched{stage="normal"} and is here for completeness
    • selected - the number of chunks that were selected because they overlap with the request's minT/maxT; the value of this is included in parsed
  • cortex_bucket_store_series_data_size_touched_bytes
    • parsed - the total bytes from the raw ranges that were enough to parse the range; this is the total size of raw ranges minus the bytes we overfetched due to overestimating the chunk size
    • selected - the raw size of chunks that were selected because they overlap with the request's minT/maxT

To make these changes I also had to change the dashboards so that we don't double-count stage="parsed" and stage="selected" as their records overlap.

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>
@dimitarvdimitrov
Copy link
Contributor Author

i'll keep in draft until i get some feedback. Will also add a changelog entry after that.

Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>
Copy link
Collaborator

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm 👍 👍 👍 to improve these metrics. I left few comments, but overall LGTM.

s.metrics.seriesDataSizeTouched.WithLabelValues("chunks").Observe(float64(stats.chunksTouchedSizeSum))
s.metrics.seriesDataSizeFetched.WithLabelValues("chunks").Observe(float64(stats.chunksFetchedSizeSum))
s.metrics.seriesDataTouched.WithLabelValues("postings", "").Observe(float64(stats.postingsTouched))
s.metrics.seriesDataFetched.WithLabelValues("postings", "").Observe(float64(stats.postingsFetched))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the metric description of the "fetched" metric, can you clarify it's "fetched from the object storage", please? And in the metric of "touched" can you clarify it's "either fetched from the object storage or cache"?

// Currently fineGrainedChunksCachingEnabled is always false. After it is configurable,
// these stats will be recorded by the new caching implementation.
if s.fineGrainedChunksCachingEnabled && s.maxSeriesPerBatch > 0 {
s.metrics.seriesDataFetched.WithLabelValues("chunks", "normal").Observe(float64(stats.chunksFetched))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered renaming "normal" into "fetch"? Basically it's a normal "fetch".

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] Also, does it make sense to call it "fetched" and "refetched", given we have "parsed" and "selected"?

Comment on lines +44 to +47
chunksParsed int
chunksParsedSizeSum int
chunksRefetched int
chunksRefetchedSizeSum int
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's hard to review this PR because I would like to see how these are tracked.

s.metrics.seriesDataTouched.WithLabelValues("chunks", "parsed").Observe(float64(stats.chunksParsed))
s.metrics.seriesDataSizeTouched.WithLabelValues("chunks", "parsed").Observe(float64(stats.chunksParsedSizeSum))

s.metrics.seriesDataTouched.WithLabelValues("chunks", "selected").Observe(float64(stats.chunksTouched))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered renaming "selected" to "returned"? To my understanding these are the chunks returned because within the query min/max time range.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants