Compact: Offline deduplication #1276

smalldirector · 2019-06-24T21:52:46Z

Changes

As the discussion in #1014, implement a new feature in compactor which provides offline deduplication function for the data from different replicas.

The offline deduplication follow the same design with query deduplication. The user needs to specify replica label by config dedup.replica-label before enable it, and the function uses current query deduplication algorithm(penalty algorithm) to merge data points come from different replicas.

The offline deduplication function is based on bucket level in remote storage, so the user needs to ensure that all replica data write to same bucket.

Verification

Compared data quality by tsdb read API.
Compared the graph by Thanos query UI.

Below figures are the comparison for one sample metrics before dedup and after dedup. It defines one interval replica label _agg_replica_ to represent the merged data.

Tested the dedup function online, and monitoring its metrics(especially memory usage).

For below figure, each of block size is around 3GB and we have two replicas for each block. As it is using streaming read/write way to operate block, so no OOM exception happens.

smalldirector · 2019-06-24T21:54:27Z

@bwplotka I submitted the PR for offline deduplication function, please help add reviewers and review the codes. Thanks.

bwplotka · 2019-07-17T14:07:03Z

Thanks @smalldirector and we definitely need that (: Is there any way you can split this PR to smaller chunks in any way? It will will improve reviewing process majorily!

smalldirector · 2019-07-18T18:59:20Z

@bwplotka Good question here. I was trying to split the PR before, however I didn't figure out one good way. For my understanding, we need to keep the function work in the PR before it is merged. However my PR only provide one function dedup.
Do you have any suggestion on how to break it?

smalldirector · 2019-07-19T17:01:12Z

@bwplotka The dudep function support both of raw metrics and downsampling metrics, should I split it to two separate PRs(one is for raw metrics, another is for downsampling metrics)?

bwplotka

Awesome! Thanks for this great work, looks solid. I think this might work from the first glance, but I would love to make this bit less complex, smaller in LOCs and less memory consuming before merging.

TL;DR: Reuse stuff! (:

You are right it is hard to split this - but I think we can really decrease those codelines majorily by reusing more code.

I proposed various option to reuse code that is there already, so we can have only one way of doing things e.g:

deduplicating (both offline and online) (iterators)
syncing
downloading block
specifying time ranges
specifying aggr type

It will reduce number of lines for this PR a lot IMO. Also we probably need to use iterators to stream the whole operation - otherwise it will take too much memory ): I know it is hard work to make this happen, but happy to help with that as well.

We could split then: Making sure Syncer is generic enough to be used by deduper and then rest - we could start with that.

bwplotka · 2019-07-22T16:38:00Z

pkg/compact/dedup/replica.go

+	return NewReplicas(s.labelName, result)
+}
+
+func (s *ReplicaSyncer) download(ctx context.Context, id ulid.ULID) (*metadata.Meta, error) {


Not a fan of adding yet another Download method. Same with Syncer. Can we reuse those?

We already have problem with having multiple Syncer so we have #1335

Sure, will figure out one proper way to reuse the functions in the compact.Syncer.
From the #1335, I saw we would start to unify the syncer APIs.
I think if I can do this change after #1335, it will save some efforts.

bwplotka · 2019-07-22T16:38:36Z

pkg/compact/dedup/replica.go

+	}
+}
+
+func (s *ReplicaSyncer) Sync(ctx context.Context) (Replicas, error) {


We already have this for compactor. Can we reuse somehow?

same with above.

bwplotka · 2019-07-22T16:39:16Z

pkg/compact/dedup/reader.go

+	postings index.Postings
+}
+
+func NewBlockReader(logger log.Logger, resolution int64, blockDir string) (*BlockReader, error) {


Why we need this? We already have tsdb.BlockReader

When try to get the sorted postings for one block, it needs to call four functions to make it: OpenBlock(), Index(), Postings(), SortedPostings(). Here just use this BlockReader to provide a convenient way to access one block's postings and chunk reader.

bwplotka · 2019-07-22T16:39:45Z

pkg/compact/dedup/reader.go

+	// samples from non-downsample blocks
+	RawSample SampleType = iota
+	// samples from downsample blocks, map to downsample AggrType
+	CountSample


Can we just reuse AggrType? This will avoid those mappings and simplify code (:

Ok, let me figure out the way to handle raw sample with AggrType.

bwplotka · 2019-07-22T16:42:25Z

pkg/compact/dedup/reader.go

+	return chks, nil
+}
+
+type SampleReader struct {


Hm. We deduplicate in Iterators interface currently... Wonder if we can reuse same algorithm as here: https://github.com/thanos-io/thanos/blob/master/pkg/query/iter.go#L375

Again - to reuse code to reduce overall complexity for understanding Thanos codebase (: and to gain benefits from improvements by changing just single place (:

Actually this SampleReader doesn't do the deduplication logic. It just provides the function to read the samples under the specified time range.

And using DedupSeriesIterator to do the dedup logic:

thanos/pkg/compact/dedup/merger.go

Line 421 in 62cc9bf

merge := func(st SampleType) []*Sample {

So both of online and offline deduplication are using the same penalty algorithm.

bwplotka · 2019-07-22T16:44:08Z

pkg/compact/dedup/merger.go

+	"github.com/thanos-io/thanos/pkg/query"
+)
+
+type TimeWindow struct {


We already have tsdb.TimeRange{}

Can we reuse it?

Good to know tsdb has TimeRange . Will change to use it.

bwplotka · 2019-07-22T16:45:51Z

pkg/compact/dedup/merger.go

+
+type BlockGroups []*BlockGroup
+
+func NewBlockGroups(replicas Replicas) BlockGroups {


This might be doable in Iterator as well. I think it would more understandable if we would have just one method/algo for this - for both offline and online deduplication. Online deduplication is available here: https://github.com/thanos-io/thanos/blob/master/pkg/query/iter.go#L304

The BlockGroups provides the function to group all the blocks under the specified time range. It doesn't provide any deduplication here.

pkg/compact/dedup/dedup_test.go

smalldirector · 2019-07-22T22:24:05Z

@bwplotka Thanks for your 1st round review. Totally agree with you that we should leverage more on the existed codes. For the time range and aggr type, I made the code changes. For the sync and downloading block, as I mentioned above, the #1335 will do the refactor task on those function. If I do it after the #1335, it will save us lots of efforts. Currently the offline deduplication logic follows the same dedup algorithm with online deduplication, so no code changes need to be done here.

bwplotka · 2019-08-06T13:28:35Z

Cool @smalldirector , so is it ready for second review iteration? (:

smalldirector · 2019-08-06T17:28:56Z

@bwplotka Yes, please help do the next round review. For the meta syncing and block downloading, I would like to do it after #1335 is fixed.

bwplotka · 2019-10-11T10:11:18Z

We need to get back to this finally (: cc @GiedriusS @povilasv @brancz

Sorry for delay.

smalldirector · 2019-10-11T22:15:22Z

Rebased the codes for the upcoming review.

povilasv · 2019-10-14T07:04:19Z

cmd/thanos/compact.go

 	f := func() error {
+		if isEnableDedup(enableDedup, dedupReplicaLabel) {


Suggested change

if isEnableDedup(enableDedup, dedupReplicaLabel) {

if enableDedup && len(dedupReplicaLabel) > 0 {

povilasv · 2019-10-14T07:06:22Z

cmd/thanos/compact.go

@@ -463,3 +479,7 @@ func generateIndexCacheFile(
 	}
 	return nil
 }
+
+func isEnableDedup(enableDedup bool, dedupReplicaLabel string) bool {


Consider getting rid of this one line function, IMO no point in abstracting single line of code :)

Ok, I can remove this function.

povilasv · 2019-10-14T07:08:53Z

pkg/compact/compact.go

@@ -665,7 +665,7 @@ type RetryError struct {
 	err error
 }

-func retry(err error) error {
+func Retry(err error) error {


Does this need to be exported method? If yes add comment, but I would keep it internal if possible

As this function is used in another package scope, I need to change it as public. I will add comment for this method.

povilasv · 2019-10-14T07:11:38Z

pkg/compact/dedup/dedup.go

+	"github.com/thanos-io/thanos/pkg/objstore"
+)
+
+type BucketDeduper struct {


We are missing comments on exported structs / methods / functions.

I will review all the codes and see whether I can change some exported structs / methods / functions to interval. If not, I will add comments for them. Thanks.

Reamer · 2019-10-18T08:16:49Z

Hi @smalldirector,

At first, thank you for developing this feature. I tested your feature on my test environment and I will keep this running. If I encounter issues, i'll let you known.

Used code

I've done a merge with thanos v0.8.1.

Before De-duplication

I used thanos bucket web with --label=prometheus_replica.
I have two prometheus statefulsets in a HA setup, created with prometheus-operator. One is called prometheus-k8s, the other prometheus-avm.

During De-duplication

For both prometheus setups I noticed a lot of warn message of the kind find empty downsample chunk during de-duplication of resolution 3600000.
Full error message:

resolution=3600000 metric in prometheus-k8s job:
{"caller":"reader.go:324","labels":"{__name__=\"admission_quota_controller_work_duration\",endpoint=\"https\",instance=\"10.20.15.13:443\",job=\"apiserver\",namespace=\"default\",quantile=\"0.99\",service=\"kubernetes\"}","level":"warn","msg":"find empty downsample chunk","ref":887438,"ts":"2019-10-16T07:16:23.650581433Z","type":"counter"}

resolution=3600000 metric in prometheus-avm job:
{"caller":"reader.go:324","labels":"{__name__=\"http_request_duration_microseconds\",endpoint=\"http\",handler=\"flags\",instance=\"10.130.17.27:10902\",job=\"thanos-metrics\",namespace=\"thanos\",pod=\"thanos-querier-657d5f7d9-qvzmq\",quantile=\"0.5\",service=\"thanos-metrics\"}","level":"warn","msg":"find empty downsample chunk","ref":51486867,"ts":"2019-10-17T05:57:53.170766602Z","type":"counter"}

De-duplication of resolution 0 (Raw-files) and resolution 300000 was done without warning or error messages.

After De-duplication

After De-duplication was succeeded, the downsampling algorithm had starting to create new blocks for 300000 (A third block for 300000). I have no idea why that was happening. Maybe you or an other developer knows more. I interrupted this process and deleted all blocks of resolution 300000 and 3600000. After this I restarted compactor with downsampling again.

After downsampling overnight everything looks nearly perfect.

After changing thanos bucket web config to --label=prometheus it looks perfect.

Proposal to change from user perspective

Checkout de-duplication algorithm for resolution 3600000.

ivan-kiselev · 2019-10-30T12:17:55Z

This would be just an awesome thing to have. Looking forward to it!

smalldirector · 2019-11-04T17:53:42Z

Just came back from vacation. Thanks for review @povilasv, I will do the related code changes with comments. @Reamer Thanks for your testing, I will check your question.

smalldirector · 2019-11-10T04:09:58Z

@Reamer The logic for throwing the warning log find empty downsample chunk is pretty simple. The function tries to read chunk data with given downsampling aggr type, if no such chunk found, it will throw such logs.
From your test, there is no counter downsampling data for given label set. Can you write one simple block reader to test your downsampling block?

Signed-off-by: shuaizhang <shuaizhang@ebay.com>

Signed-off-by: Alan Zhang <shuai.xyz@gmail.com>

smalldirector · 2019-12-15T02:26:47Z

@Reamer Merged the latest codes and fixed the issues.

bwplotka

Thanks for maintaining it!

I think it's a lot of code to maintain and support but there is a way to simplify this by reusing more from Prometheus.

Yes, we need this grouping code which you did which is amazing, but for deduping we might need to just:

Abstract in Prometheus chunks.MergeOverlappingChunks(chks) and allow injecting custom (chks []Meta) ([]Meta, error) { mergers. This requires a change in upstream which I think Prometheus maintainers will accept (:
Implement merger bases on our deduplication logic we have here.

What do you think? (: Would love to have this in Thanos finally ❤️

smalldirector · 2019-12-22T19:35:48Z

Thanks for the suggestion.
AFAIK, Thanos supports both of raw and downsampling data block, but there is no downsampling concept in Prometheus world. We may can use the function chunks.MergeOverlappingChunks(raw chks) provided by Prometheus to do the merge logic, however it seems to be impossible to support MergeOverlappingChunks(downsampling chks) in Prometheus.
I don't think we can use this one function to handle both of raw and downsampling cases.

bwplotka · 2019-12-23T09:48:15Z

chunks.MergeOverlappingChunks(raw chks)

This should be fine as we hide downsampling chunks under common Prometheus chunk interface (:

smalldirector · 2019-12-23T18:25:09Z

chunks.MergeOverlappingChunks(raw chks)

This should be fine as we hide downsampling chunks under common Prometheus chunk interface (:

Can anyone help make this change in Prometheus upstream?

bwplotka · 2020-01-09T12:19:26Z

Happy New Year!

Can anyone help make this change in Prometheus upstream?

In what sense help?

I can help, but it's as easy as proposing a change to injecting interface/function for chunks.MergeOverlappingChunks(raw chks) to Compactor (: Happy to look on that tomorrow.

Reamer · 2020-01-31T13:07:27Z

Hi,
I doesn't want, that this cool feature falls into sleep. @bwplotka Can we move forward here?

I get a merge conflict with v0.10.1. @smalldirector Can you please rebase your code to current master.

Thanks for your work

Reamer

bwplotka · 2020-02-06T11:11:24Z

@Reamer @smalldirector We looked in details and we believe we can do 90% of this PR by just reusing existing components (with small upstream changes). Cc @metalmatze

We will spend some time next week on this to show what we mean (: No work will be waste as grouping etc can be used from the code here!

vjsamuel · 2020-02-19T23:41:32Z

@bwplotka curious, would you just add into this PR and add move the feature forward or would this PR be merged in and modified? we would like to see this feature in place as we see great value add. @smalldirector thank you for continuing work on this.

bwplotka · 2020-02-24T13:35:38Z

The feature is literally 100 lines of code if we would integrate all with TSDB code and reuse more, so I would say another PR. Are you ok with this @smalldirector ? Looks like Shuai is inactive as well recently.

Thank you for your hard work on this, I think we learned a lot with this implementation. Will close for now.

cc @metalmatze who is working on this, this week.

d-ulyanov · 2020-02-24T16:47:09Z

waiting for this feature, @bwplotka is there another PR already?

bwplotka · 2020-02-24T16:54:03Z

@metalmatze is working on the first iteration, I am preparing interface upstream, maybe as part of this: prometheus/prometheus#5882

vjsamuel · 2020-03-16T19:19:59Z

@bwplotka @metalmatze is there an update on this one?

Reamer · 2020-03-17T07:16:22Z

prometheus/prometheus#5882 -> prometheus/prometheus#6990

bwplotka · 2020-03-17T18:15:59Z

Yes, we are close to enable vertical compaction (which is a bit different than offline): #2250

Then for offline indeed it requires prometheus/prometheus#6990 which is cloooooooose to be done (:

smalldirector mentioned this pull request Jun 26, 2019

Compact: Offline deduplication #1275

Closed

smalldirector changed the title ~~Compact: Offline deduplication~~ [WIP]Compact: Offline deduplication Jul 1, 2019

GiedriusS added component: compact feature request/improvement labels Jul 12, 2019

smalldirector changed the title ~~[WIP]Compact: Offline deduplication~~ Compact: Offline deduplication Jul 15, 2019

bwplotka requested review from bwplotka, povilasv and GiedriusS July 17, 2019 14:05

smalldirector force-pushed the offline-dedup branch from 87e94eb to 62cc9bf Compare July 19, 2019 21:24

bwplotka requested changes Jul 22, 2019

View reviewed changes

smalldirector force-pushed the offline-dedup branch from 0d51ad0 to 78c280d Compare August 6, 2019 17:17

smalldirector force-pushed the offline-dedup branch 2 times, most recently from 395dcce to 0031d8e Compare October 11, 2019 22:14

povilasv reviewed Oct 14, 2019

View reviewed changes

smalldirector force-pushed the offline-dedup branch from 0031d8e to cfb819c Compare November 10, 2019 04:01

shuaizhang and others added 10 commits December 14, 2019 16:58

support downsample block

f8db845

Signed-off-by: shuaizhang <shuaizhang@ebay.com>

use tsdb.TimeRange to represent time window

8918e16

Signed-off-by: shuaizhang <shuaizhang@ebay.com>

reuse downsample.AggrType

051b480

Signed-off-by: shuaizhang <shuaizhang@ebay.com>

fix lint test error

6c36c68

Signed-off-by: shuaizhang <shuaizhang@ebay.com>

fix downsampling issue

d00a03e

Signed-off-by: shuaizhang <shuaizhang@ebay.com>

rebase codes

0f9156d

Signed-off-by: shuaizhang <shuaizhang@ebay.com>

add comments for exported structs and functions

de6faba

Signed-off-by: shuaizhang <shuaizhang@ebay.com>

fix doc error

ef71fe1

Signed-off-by: shuaizhang <shuaizhang@ebay.com>

handle empty chunk

63ab603

Signed-off-by: shuaizhang <shuaizhang@ebay.com>

fix labels import issue

f37dac1

Signed-off-by: Alan Zhang <shuai.xyz@gmail.com>

smalldirector force-pushed the offline-dedup branch from ae51c4d to f37dac1 Compare December 15, 2019 01:12

brancz mentioned this pull request Dec 20, 2019

Introduce an internal option to enable vertical compaction #1917

Merged

2 tasks

bwplotka reviewed Dec 20, 2019

View reviewed changes

bwplotka closed this Feb 24, 2020

masci mentioned this pull request Sep 2, 2020

metricbeat: prometheus module and de-duplication support elastic/beats#20430

Closed

yeya24 mentioned this pull request May 16, 2021

Compact: Offline deduplication #1014

Closed


		type BlockGroups []*BlockGroup

		func NewBlockGroups(replicas Replicas) BlockGroups {

		f := func() error {
		if isEnableDedup(enableDedup, dedupReplicaLabel) {

	if isEnableDedup(enableDedup, dedupReplicaLabel) {
	if enableDedup && len(dedupReplicaLabel) > 0 {

Compact: Offline deduplication #1276

Compact: Offline deduplication #1276

Conversation

smalldirector commented Jun 24, 2019

Changes

Verification

smalldirector commented Jun 24, 2019

bwplotka commented Jul 17, 2019

smalldirector commented Jul 18, 2019

smalldirector commented Jul 19, 2019

bwplotka left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smalldirector commented Jul 22, 2019

bwplotka commented Aug 6, 2019

smalldirector commented Aug 6, 2019 • edited Loading

bwplotka commented Oct 11, 2019

smalldirector commented Oct 11, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Reamer commented Oct 18, 2019

Used code

Before De-duplication

During De-duplication

After De-duplication

Proposal to change from user perspective

ivan-kiselev commented Oct 30, 2019

smalldirector commented Nov 4, 2019

smalldirector commented Nov 10, 2019

smalldirector commented Dec 15, 2019

bwplotka left a comment

Choose a reason for hiding this comment

smalldirector commented Dec 22, 2019 • edited Loading

bwplotka commented Dec 23, 2019

smalldirector commented Dec 23, 2019

bwplotka commented Jan 9, 2020 • edited Loading

Reamer commented Jan 31, 2020

bwplotka commented Feb 6, 2020

vjsamuel commented Feb 19, 2020

bwplotka commented Feb 24, 2020

d-ulyanov commented Feb 24, 2020

bwplotka commented Feb 24, 2020

vjsamuel commented Mar 16, 2020

Reamer commented Mar 17, 2020

bwplotka commented Mar 17, 2020 • edited Loading

smalldirector commented Aug 6, 2019 •

edited

Loading

smalldirector commented Dec 22, 2019 •

edited

Loading

bwplotka commented Jan 9, 2020 •

edited

Loading

bwplotka commented Mar 17, 2020 •

edited

Loading