Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

store: label_values: fetch less postings #7814

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

xBazilio
Copy link
Contributor

@xBazilio xBazilio commented Oct 10, 2024

  • I added CHANGELOG entry for this change.
  • Change is not relevant to the end user.

Changes

This change optimizes label_values request to store.

Grafana lets you define template variables. You can define pod variable with query like this: label_values(kube_pod_info{}, pod).

In older versions of grafana this query will lead to series request to store. In newer versions and with particular datasource settings this query will lead to label_values request. End result will be the same, but it considerably differs in data being downloaded from object storage.

Investigations show that in case of label_values thanos store fetches alot of postings. This is due to matcher <labelName> != "" being added in label_values request.

Analyzing debug logs of query stat we can see, that after the change we can download considerably less data from object storage.

label_values_patch

This change checks, if matchers contain __name__ matcher. This matcher, if present, lessens the number of data needed to be downloaded from storage. Because matchers are evalueated separately during blockClient.ExpandPostings() call, matcher <labelName> != "" will select all references to series which have the label. In case of popular labels such as pod in example above this will selecet alot of references.

The change won't affect queries like label_values(pod) or label_values({some_label="some_value"}, pod), where __name__ matcher isn't specified.

Verification

Signed-off-by: Vasiliy Rumyantsev <4119114+xBazilio@users.noreply.github.com>
Signed-off-by: Vasiliy Rumyantsev <4119114+xBazilio@users.noreply.github.com>
Signed-off-by: Vasiliy Rumyantsev <4119114+xBazilio@users.noreply.github.com>
@pull-request-size pull-request-size bot added size/M and removed size/S labels Oct 10, 2024
MichaHoffmann
MichaHoffmann previously approved these changes Oct 10, 2024
Copy link
Contributor

@yeya24 yeya24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Good catch.
Do we need to update the comment on L2148?

// Should never be empty since we added labelName!="" matcher to the list of matchers.

This is not necessarily true now if we don't add the non empty matcher if it has metric name?

Signed-off-by: Vasiliy Rumyantsev <4119114+xBazilio@users.noreply.github.com>
yeya24
yeya24 previously approved these changes Oct 10, 2024
Copy link
Contributor

@yeya24 yeya24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@@ -2033,7 +2042,8 @@ func (s *BucketStore) LabelValues(ctx context.Context, req *storepb.LabelValuesR

// If we have series matchers and the Label is not an external one, add <labelName> != "" matcher
// to only select series that have given label name.
if len(reqSeriesMatchersNoExtLabels) > 0 && !b.extLset.Has(req.Label) {
// We don't need such matcher if matchers already contain __name__ matcher.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to understand this comment. Why it is specific to the metric name label?
It is just a tradeoff of fetching more postings or series.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my experience, if __name__ is specified, it means, user knows, that the metric contains requested label. The method will select all series with such __name__ and they 99% will have the requested label.
As for random queries, where there can be results with the __name__ but without specified labels, normally, they should be rare. Users still can make queries like label_values({}, pod). They will work, but will fetch alot of data.
So the point is, if the user knows what they need and specifies __name__, we don't need to save them from fetching some extra series. But we do save them from fetching all references to all kube_state_metrics from object storage for example.
Other popular labels may be service, application, job, instance. If __name__ is specified, it is guaranteed it'll be less data, but all the label values.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not against this. Just hope we have a better way to cover more labels because we have those information from posting cardinality and series size.

Signed-off-by: Vasiliy Rumyantsev <4119114+xBazilio@users.noreply.github.com>
CHANGELOG.md Outdated Show resolved Hide resolved
Signed-off-by: Vasiliy Rumyantsev <4119114+xBazilio@users.noreply.github.com>
Copy link
Contributor

@yeya24 yeya24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants