Added binary index header implementation. #1952

bwplotka · 2020-01-07T19:26:47Z

This PR adds index-header implementation based on this design.

It adds a separate indexheader.Binary* structs and methods allowing to build and read index-header in binary format.

Fixes: #1711
Fixes: #1873
Fixes: #1750
Fixes: #1655
Fixes: #1471
Fixes: #1455
Fixes: #942
Fixes: #448
Fixes: #1750

cc @brian-brazil @codesome as you are super familiar with the TSDB format.
cc @pracucci

Stats

Size difference:

10k series for my autogenerated block:

-rw-r--r-- 1 bwplotka bwplotka 6.1M Jan 10 13:20 index
-rw-r--r-- 1 bwplotka bwplotka  23K Jan 10 13:20 index.cache.json
-rw-r--r-- 1 bwplotka bwplotka 9.2K Jan 10 13:20 index-header

For production block 8mln series, also similar gain.

-rw-r--r-- 1 bwplotka bwplotka 1.9G Jan 10 13:29 index
-rw-r--r-- 1 bwplotka bwplotka 287M Jan 10 13:29 index.cache.json
-rw-r--r-- 1 bwplotka bwplotka 122M Jan 10 13:29 index-header

NOTE: Size is smaller, but it's not what we are trying to optimize for. Nevertheless
PostingOffsets and Symbols takes a significant amount of bytes. The only downsides of size
is the fact that to create such index-header we have to fetch those two parts ~60MB each from object storage.
Idea for another improvement at some point would if it will become a problem: Cache only 32th of the posting ranges and fetch gaps between on demand
on query time (with some cache).

Real-time latencies for creation and loading (without network traffic):

For 10k block it's similar for both (ms/micros), for 8mln we can spot the difference:

index-header:

write 134.197732ms
read 415.971774ms

index-cache.json:

write 6.712496338s
read 6.112222132s

Go Benchmarks:

Before comparing I changed names to correlate tests with benchcmp:

BenchmarkJSONReader-12-> BenchmarkRead-12 old
BenchmarkBinaryReader-12 -> BenchmarkRead-12 new
BenchmarkJSONWrite-12 -> BenchmarkWrite-12 old
BenchmarkBinaryWrite-12  -> BenchmarkWrite-12 new

10k series block:

benchmark             old ns/op     new ns/op     delta
BenchmarkRead-12      591780        66613         -88.74%
BenchmarkWrite-12     2458454       6532651       +165.72%

benchmark             old allocs     new allocs     delta
BenchmarkRead-12      2306           629            -72.72%
BenchmarkWrite-12     1995           64             -96.79%

benchmark             old bytes     new bytes     delta
BenchmarkRead-12      150904        32976         -78.15%
BenchmarkWrite-12     161501        73412         -54.54%

CPU time for smaller index file is interesting. Value is low anyway. Might be
something to follow up.

8mln series (index takes 2GB so not committed to git):

benchmark             old ns/op      new ns/op     delta
BenchmarkRead-12      7026290474     552913402     -92.13%
BenchmarkWrite-12     6480769814     276441977     -95.73%

benchmark             old allocs     new allocs     delta
BenchmarkRead-12      20100014       5501312        -72.63%
BenchmarkWrite-12     18263356       64             -100.00%

benchmark             old bytes      new bytes     delta
BenchmarkRead-12      1873789526     406021516     -78.33%
BenchmarkWrite-12     2385193317     74187         -100.00%

NOTE: Benchmarks for query time and CHANGELOG will go in next PR.

Signed-off-by: Bartlomiej Plotka bwplotka@gmail.com

pkg/block/indexheader/header_test.go

Depends on: #1952 Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

Enabled it also on all our tests. Depends on: #1952 Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

GiedriusS

Thanks for picking up the work on this! Unfortunately I didn't have enough time back in the day. Obviously I'm not an super expert on the TSDB format but the code seems clean and the edge cases were thought through AFAICT. Took me some time to wrap my head around all of this. Just some minor comments. The cherry on top of the pie would be to add some kind of flag to Thanos Compact which would migrate all of the existing JSON files into the binary ones and improve the binary reader to try to download them from remote object storage. But I guess that could be added when we will move this from the experimental stage.

GiedriusS · 2020-01-13T21:22:46Z

pkg/block/indexheader/header.go

 	LookupSymbol(o uint32) (string, error)
-	LabelValues(name string) []string
+
+	// LabelValues returns all label values for given label name or error if not found.


The comment here seems a bit wrong: we return an error when an actual error had occurred, not when we haven't found any label values.

GiedriusS · 2020-01-13T21:24:06Z

pkg/block/indexheader/binary_reader.go

+type BinaryTOC struct {
+	// Symbols holds start to the same symbols section as index related to this index header.
+	Symbols uint64
+	// PostingsTable holds start to the the same Postings Offset Table section as index related to this index header.


The comment should start here with PostingsOffsetTable.

GiedriusS · 2020-01-13T21:24:57Z

pkg/block/indexheader/binary_reader.go

+			}
+			// TODO(bwplotka): This is wrong, probably we have to sort.
+			if lastKey != nil {
+				prevRng.End = int64(off + 4)


In a lot of places we add 4 to different offsets. I guess that's because of how big crc32 is. I think the code's readability could be improved if we would move that constant 4 into a variable. crc32.Size could probably be reused.

done, but it's not crc, but size of length of posting.

GiedriusS · 2020-01-13T21:44:51Z

pkg/block/indexheader/binary_reader.go

+			if len(key) != 2 {
+				return errors.Errorf("unexpected key length for posting table %d", len(key))
+			}
+			// TODO(bwplotka): This is wrong, probably we have to sort.


So should we sort this or not? 😄

GiedriusS · 2020-01-13T21:50:05Z

pkg/block/indexheader/header_test.go

+func compareIndexToHeader(t *testing.T, indexByteSlice index.ByteSlice, headerReader Reader) {
+	indexReader, err := index.NewReader(indexByteSlice)
+	testutil.Ok(t, err)
+	defer func() { _ = indexReader.Close() }()


Shouldn't we check that this succeeds?

no, because it's a Prometheus indexReader and we are not testing that one.

pracucci

Magnificent job @bwplotka ! The change set LGTM. The only reason why I haven't approved is because I don't feel comfortable enough to deeply review BinaryReader. I left few comments here and there, mainly questions.

pracucci · 2020-01-20T11:47:51Z

pkg/store/bucket.go

+		return nil, errors.Wrap(d.Err(), "read postings list")
+	}
+	// 4 for posting length, then n * 4, foreach each big endian posting.
+	return b[:4+n*4], nil


Shouldn't we double check there are actually 4+n*4 bytes in b?

nice catch!

pracucci · 2020-01-20T11:51:05Z

pkg/store/bucket.go

-				c := b[p.ptr.Start-start : p.ptr.End-start]
-
-				_, fetchedPostings, err := r.dec.Postings(c)
+				// index-header can estimate endings, which means we need to resize the endings.


Shouldn't be the index-header reader responsability to return the right offsets, instead of having resizePostings() here?

We can't due to arbitrary padding that can occur - it's fine though, overfetching is usually small.

We can only exactly resize once we know number of entries from fetched posting.

pracucci · 2020-01-20T11:52:23Z

pkg/store/bucket.go

+	cur  uint32
+}
+
+// TODO(bwplotka): Expose those inside Prometheus.


pracucci · 2020-01-20T13:08:24Z

pkg/block/indexheader/binary_reader.go

+		return br, nil
+	}
+
+	level.Warn(logger).Log("msg", "failed to read index-header from disk; recreating", "path", binfn, "err", err)


Is this a warning? Looks more an Info() to me, given we don't expect to find the index-header on disk the first time a block is seen.

👍 Definitely! I would say even debug.

pracucci · 2020-01-20T13:24:04Z

pkg/block/indexheader/binary_reader.go

+
+	level.Warn(logger).Log("msg", "failed to read index-header from disk; recreating", "path", binfn, "err", err)
+
+	if err := WriteBinary(ctx, bkt, id, binfn); err != nil {


Would be nice tracking an histogram metric with the time it takes.

hm.. let's start with elapsed time in log line WDYT?

pracucci · 2020-01-20T13:38:22Z

pkg/block/indexheader/binary_reader.go

+				r.postings[key[0]] = []postingOffset{}
+				if lastKey != nil {
+					// Always include last value for each label name.
+					r.postings[lastKey[0]] = append(r.postings[lastKey[0]], postingOffset{value: lastKey[1], tableOff: lastTableOff})


Despite the logic is correct, may be more clear if we also set lastKey = nil after this line. Up to you.

pracucci · 2020-01-20T13:54:21Z

pkg/block/indexheader/binary_reader.go

+			}
+			if i+1 == len(e) {
+				for i := range tmpRngs {
+					tmpRngs[i].End = r.indexLastPostingEnd


Does this mean that when we'll load the postings, if we're reading the first one we're going to download from the bucket the entire postings section?

Wow ❤️ You just found a major bug here! Nice - I fixed it and made tests more strict! Awesome!

This PR adds index-header implementation based on [this design](https://thanos.io/proposals/201912_thanos_binary_index_header.md/) It adds a separate indexheader.Binary* structs and method allowing to build and read index-header in binary format. Size difference: 10k series for my autogenerated data it's 2.1x -rw-r--r-- 1 bwplotka bwplotka 6.1M Jan 10 13:20 index -rw-r--r-- 1 bwplotka bwplotka 23K Jan 10 13:20 index.cache.json -rw-r--r-- 1 bwplotka bwplotka 9.2K Jan 10 13:20 index-header For realistic block 8mln series, also similar gain. -rw-r--r-- 1 bwplotka bwplotka 1.9G Jan 10 13:29 index -rw-r--r-- 1 bwplotka bwplotka 287M Jan 10 13:29 index.cache.json -rw-r--r-- 1 bwplotka bwplotka 122M Jan 10 13:29 index-header NOTE: Size is smaller, but it's not what we are trying to optimize for. Nevertheless PostingOffsets and Symbols takes significant amount of bytes. The only downsides of size is the fact that to create such index-header we have to fetch those two parts ~60MB each from object storage. Idea for improvement if that will become a problem: Cache only 32th of the posting ranges and fetch gaps between on demand on query time (with some cache). Real time latencies for creation and loading (without network traffic): For 10k block it's similar for both (ms/micros), for 8mln we can spot the difference: index-header: * write 134.197732ms * read 415.971774ms index-cache.json: * write 6.712496338s * read 6.112222132s Before comparing I changed names to correlate tests: BenchmarkJSONReader-12-> BenchmarkRead-12 old BenchmarkBinaryReader-12 -> BenchmarkRead-12 new BenchmarkJSONWrite-12 -> BenchmarkWrite-12 old BenchmarkBinaryWrite-12 -> BenchmarkWrite-12 new benchmark old ns/op new ns/op delta BenchmarkRead-12 591780 66613 -88.74% BenchmarkWrite-12 2458454 6532651 +165.72% benchmark old allocs new allocs delta BenchmarkRead-12 2306 629 -72.72% BenchmarkWrite-12 1995 64 -96.79% benchmark old bytes new bytes delta BenchmarkRead-12 150904 32976 -78.15% BenchmarkWrite-12 161501 73412 -54.54% CPU time for smaller index file is interesting. Value is low anyway. Might be something to follow up. benchmark old ns/op new ns/op delta BenchmarkRead-12 7026290474 552913402 -92.13% BenchmarkWrite-12 6480769814 276441977 -95.73% benchmark old allocs new allocs delta BenchmarkRead-12 20100014 5501312 -72.63% BenchmarkWrite-12 18263356 64 -100.00% benchmark old bytes new bytes delta BenchmarkRead-12 1873789526 406021516 -78.33% BenchmarkWrite-12 2385193317 74187 -100.00% Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

Enabled it also on all our tests. Depends on: #1952 Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

bwplotka · 2020-01-20T21:19:24Z

Thanks for reviews! All should be addressed.

@GiedriusS:

The cherry on top of the pie would be to add some kind of flag to Thanos Compact which would migrate all of the existing JSON files into the binary ones and improve the binary reader to try to download them from remote object storage. But I guess that could be added when we will move this from the experimental stage.

I would say, remove all index-cache JSON file as we build all on-demand on store Gateway (:

@pracucci kudos for finding quite a major bug, I think the Series() micro-benchmark would find this.. if we would have any (: TODO for next PR to iterate on index-header.

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

bwplotka · 2020-01-21T07:48:21Z

Should be rdy to go, so we can iterate on it later on 👍

It is hidden behind an experimental flag for now.

@squat @brancz

pracucci · 2020-01-21T09:24:33Z

pkg/block/indexheader/binary_reader.go

+		return nil, errors.Wrap(err, "write index header")
+	}
+
+	level.Debug(logger).Log("msg", "built index-header file", "path", binfn, "elapsed", time.Since(start))


Given we don't have any metric whenever an index-header is created, if there's a bug which re-creates the index-header for a block each time (because the reader detects it as corrupted) it would be very difficult to notice, unless you run Thanos with debug logging. Switching this log to Info may help.

Right. In this case we need a metric (:

Will add that in next PR. (:

farvour · 2020-02-28T18:00:38Z

This feature really helped. Much better than mmapping the really large index json files that were being generated by our large prometheus cluster. Combined with the memcached index caching I can finally lower the memory requirements significantly in our clusters for store gateway.

Hope to see more improvements.

bwplotka · 2020-02-28T18:06:48Z

Btw did you run memcached as well? Any feedback on that part?

…

On Fri, 28 Feb 2020, 18:00 Thomas Farvour, ***@***.***> wrote: This feature really helped. Much better than mmapping the really large index json files that were being generated by our large prometheus cluster. Combined with the memcached index caching I can finally lower the memory requirements significantly in our clusters for store gateway. Hope to see more improvements. — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#1952?email_source=notifications&email_token=ABVA3O5DVHAHRQH5HHK5GH3RFFGMPA5CNFSM4KD55EY2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENJSPRI#issuecomment-592652229>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABVA3O2D3XKE6EB6LNFMEO3RFFGMPANCNFSM4KD55EYQ> .

der-eismann · 2020-03-04T08:53:28Z

Just to give some feedback: The memory consumption of our store gateway sank from 28,9 GB to 27,2 GB, which IMHO isn't such a huge improvement. But we still have no idea how the index-cache-size and chunk-pool-size options affect this. We don't use memcached yet.
However the CPU usage is now significantly lower, reduced from 0,5 core spikes to 0,1 core spikes.

bwplotka commented Jan 7, 2020

View reviewed changes

pkg/block/indexheader/header_test.go Outdated Show resolved Hide resolved

bwplotka force-pushed the build-index-header-v2 branch from 8bde3c7 to 55f3cfe Compare January 8, 2020 23:05

bwplotka mentioned this pull request Jan 9, 2020

Exposed FileWriter. prometheus/prometheus#6589

Merged

bwplotka force-pushed the build-index-header-v2 branch 8 times, most recently from 1855315 to ca2690d Compare January 10, 2020 16:57

bwplotka requested review from povilasv, brancz, GiedriusS, metalmatze, domgreen and squat January 10, 2020 17:11

bwplotka marked this pull request as ready for review January 10, 2020 17:12

bwplotka force-pushed the build-index-header-v2 branch 3 times, most recently from 52b01fd to 2163a97 Compare January 10, 2020 17:27

bwplotka added a commit that referenced this pull request Jan 10, 2020

Enabled index-header under experimental flag.

74782f4

Depends on: #1952 Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

bwplotka mentioned this pull request Jan 10, 2020

Enabled index-header under experimental flag. #1986

Merged

bwplotka added a commit that referenced this pull request Jan 10, 2020

Enabled index-header under experimental flag.

7ad49bf

Enabled it also on all our tests. Depends on: #1952 Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

bwplotka added a commit that referenced this pull request Jan 10, 2020

Enabled index-header under experimental flag. (#1986)

fb170a5

Enabled it also on all our tests. Depends on: #1952 Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

bwplotka added a commit that referenced this pull request Jan 10, 2020

Enabled index-header under experimental flag.

cca4da9

Enabled it also on all our tests. Depends on: #1952 Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

bwplotka added a commit that referenced this pull request Jan 11, 2020

Enabled index-header under experimental flag.

48e75fa

Enabled it also on all our tests. Depends on: #1952 Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

bwplotka added a commit that referenced this pull request Jan 11, 2020

Enabled index-header under experimental flag. (#1986)

632b7cf

Enabled it also on all our tests. Depends on: #1952 Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

bwplotka force-pushed the build-index-header-v2 branch from fb170a5 to 632b7cf Compare January 11, 2020 00:04

bwplotka added a commit that referenced this pull request Jan 11, 2020

Enabled index-header under experimental flag. (#1986)

c3e3adc

Enabled it also on all our tests. Depends on: #1952 Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

bwplotka force-pushed the build-index-header-v2 branch from 632b7cf to c3e3adc Compare January 11, 2020 00:19

GiedriusS approved these changes Jan 13, 2020

View reviewed changes

pracucci reviewed Jan 20, 2020

View reviewed changes

bwplotka added 3 commits January 20, 2020 18:59

Enabled index-header under experimental flag. (#1986)

0d48d75

Enabled it also on all our tests. Depends on: #1952 Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

Addressed comments.

462d607

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

bwplotka force-pushed the build-index-header-v2 branch from c3e3adc to 462d607 Compare January 20, 2020 21:17

Fixed small bug in resize.

bd2b506

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

pracucci reviewed Jan 21, 2020

View reviewed changes

bwplotka merged commit 171889d into master Jan 21, 2020

bwplotka deleted the build-index-header-v2 branch January 21, 2020 13:33

bwplotka mentioned this pull request Feb 28, 2020

Reduce store memory usage by reworking block.indexCache #1471

Closed

bwplotka mentioned this pull request Mar 2, 2020

Thanos store failed to fetch old data #2158

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added binary index header implementation. #1952

Added binary index header implementation. #1952

bwplotka commented Jan 7, 2020 •

edited

Loading

GiedriusS left a comment

GiedriusS Jan 13, 2020

GiedriusS Jan 13, 2020

GiedriusS Jan 13, 2020

bwplotka Jan 20, 2020

GiedriusS Jan 13, 2020

bwplotka Jan 20, 2020

GiedriusS Jan 13, 2020

bwplotka Jan 20, 2020

pracucci left a comment

pracucci Jan 20, 2020

bwplotka Jan 20, 2020

pracucci Jan 20, 2020

bwplotka Jan 20, 2020

pracucci Jan 20, 2020

pracucci Jan 20, 2020

bwplotka Jan 20, 2020

pracucci Jan 20, 2020

bwplotka Jan 20, 2020

pracucci Jan 21, 2020

pracucci Jan 20, 2020

pracucci Jan 20, 2020

bwplotka Jan 20, 2020

bwplotka commented Jan 20, 2020

bwplotka commented Jan 21, 2020

pracucci Jan 21, 2020

bwplotka Jan 21, 2020

bwplotka Jan 21, 2020

farvour commented Feb 28, 2020

bwplotka commented Feb 28, 2020 via email

der-eismann commented Mar 4, 2020


		level.Warn(logger).Log("msg", "failed to read index-header from disk; recreating", "path", binfn, "err", err)

		if err := WriteBinary(ctx, bkt, id, binfn); err != nil {

Added binary index header implementation. #1952

Added binary index header implementation. #1952

Conversation

bwplotka commented Jan 7, 2020 • edited Loading

Stats

Go Benchmarks:

10k series block:

8mln series (index takes 2GB so not committed to git):

GiedriusS left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pracucci left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bwplotka commented Jan 20, 2020

bwplotka commented Jan 21, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

farvour commented Feb 28, 2020

bwplotka commented Feb 28, 2020 via email

der-eismann commented Mar 4, 2020

bwplotka commented Jan 7, 2020 •

edited

Loading