Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compact: Replace group with resolution in compact metrics. #6049

Merged
merged 1 commit into from
Jun 17, 2023

Conversation

SuperQ
Copy link
Contributor

@SuperQ SuperQ commented Jan 17, 2023

  • I added CHANGELOG entry for this change.
  • Change is not relevant to the end user.

Changes

Compaction metrics have too high a cardinality, causing metric bloat on
large installations. The group information is better suited to logs.

  • Replace with a resolution label to the compaction counters.

Fixes: #5841

Verification

@SuperQ SuperQ force-pushed the superq/remove_group_label branch 2 times, most recently from 65f9dd2 to dbbd535 Compare January 17, 2023 15:16
@pull-request-size pull-request-size bot added size/L and removed size/M labels Jan 17, 2023
Copy link
Contributor

@yeya24 yeya24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes make sense to me. Thanks. We just need to fix the test

@SuperQ
Copy link
Contributor Author

SuperQ commented Jan 21, 2023

Yea, I worked on the tests a bit, but ran out of time this week. Will see if I can figure out the rest soon.

@SuperQ SuperQ force-pushed the superq/remove_group_label branch 2 times, most recently from 140e6ab to acf062d Compare February 22, 2023 12:13
@SuperQ SuperQ requested a review from yeya24 February 22, 2023 12:22
@SuperQ SuperQ changed the title compact: Remove group label from compact metrics Compact: Replace group with resolution in compact metrics. Feb 22, 2023
@SuperQ
Copy link
Contributor Author

SuperQ commented Feb 22, 2023

Ok, I updated this to fix the tests and also update the compact dashboards.

@yeya24
Copy link
Contributor

yeya24 commented Feb 23, 2023

Is the CircleCI failure related to this pr? It would be great if you could take a look

@SuperQ
Copy link
Contributor Author

SuperQ commented Feb 23, 2023

Yes, I realized I need to refactor all the tests to use the new resolution label, not group. Will work on it soon.

@SuperQ SuperQ force-pushed the superq/remove_group_label branch 3 times, most recently from a24ce81 to 66a9003 Compare March 1, 2023 09:41
@SuperQ
Copy link
Contributor Author

SuperQ commented Mar 1, 2023

Ok, much closer on the updates to the tests.

@SuperQ
Copy link
Contributor Author

SuperQ commented Mar 1, 2023

@yeya24 Any idea if the store_gateway_test.go could be related to this change? It seems unrelated.

@SuperQ
Copy link
Contributor Author

SuperQ commented Mar 20, 2023

@yeya24 I think the e2e test is flaky, this seems unrelated:

=== NAME  TestRangeQueryDynamicHorizontalSharding
    query_frontend_test.go:673: query_frontend_test.go:673: ""
        
         unexpected error: unable to find metrics [cortex_cache_fetched_keys_total] with expected values after 50 retries. Last error: <nil>. Last values: [2]     

@SuperQ
Copy link
Contributor Author

SuperQ commented Mar 31, 2023

Ping @yeya24, tests are passing now.

@SuperQ SuperQ force-pushed the superq/remove_group_label branch from b08a8bf to 522fe2e Compare May 26, 2023 06:30
yeya24
yeya24 previously approved these changes May 26, 2023
Copy link
Contributor

@yeya24 yeya24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I will let other maintainers take another look before merging as it is a breaking change.

@SuperQ
Copy link
Contributor Author

SuperQ commented May 26, 2023

Yes, this would be great to get in before the next release. I have one cluster with 90,000 metrics on the compact that is nearly useless to query. Plus others with 10k+.

@SuperQ
Copy link
Contributor Author

SuperQ commented May 27, 2023

Maybe @metalmatze or @GiedriusS ?

bwplotka
bwplotka previously approved these changes May 30, 2023
Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, so generally the groups are useful, as they allow checking each group progress. However indeed for large numbers of streams (e.g. 1k+) this gets troubling. I think it's fair to fix the cardinality explosion problem ASAP and merge this, just I see users with smaller number of groups who liked previous approach. I think we can follow up with opt-in grouping later on if needed 👍🏽

CHANGELOG.md Outdated Show resolved Hide resolved
@bwplotka
Copy link
Member

The opt-in would require a bit more work on instrumentation side, but could be safer and easier for users.

Compaction metrics have too high a cardinality, causing metric bloat on
large installations. The group information is better suited to logs.
* Replace with a `resolution` label to the compaction counters.

Fixes: thanos-io#5841

Signed-off-by: SuperQ <superq@gmail.com>
@SuperQ SuperQ dismissed stale reviews from bwplotka and yeya24 via 7eeb1bc May 30, 2023 09:40
@SuperQ SuperQ force-pushed the superq/remove_group_label branch from 522fe2e to 7eeb1bc Compare May 30, 2023 09:40
@SuperQ
Copy link
Contributor Author

SuperQ commented May 30, 2023

IMO, metrics are the wrong place to expose group level progress information. A progress API / dashboard UI would be better.

@SuperQ SuperQ requested review from bwplotka and yeya24 May 30, 2023 11:46
@SuperQ
Copy link
Contributor Author

SuperQ commented Jun 17, 2023

Can this be merged?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Compaction TODO metrics have too high a cardinality
4 participants