Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added binary index header implementation. #1952

Merged
merged 4 commits into from
Jan 21, 2020
Merged

Conversation

bwplotka
Copy link
Member

@bwplotka bwplotka commented Jan 7, 2020

This PR adds index-header implementation based on this design.

It adds a separate indexheader.Binary* structs and methods allowing to build and read index-header in binary format.

Fixes: #1711
Fixes: #1873
Fixes: #1750
Fixes: #1655
Fixes: #1471
Fixes: #1455
Fixes: #942
Fixes: #448
Fixes: #1750

cc @brian-brazil @codesome as you are super familiar with the TSDB format.
cc @pracucci

Stats

Size difference:

10k series for my autogenerated block:

-rw-r--r-- 1 bwplotka bwplotka 6.1M Jan 10 13:20 index
-rw-r--r-- 1 bwplotka bwplotka  23K Jan 10 13:20 index.cache.json
-rw-r--r-- 1 bwplotka bwplotka 9.2K Jan 10 13:20 index-header

For production block 8mln series, also similar gain.

-rw-r--r-- 1 bwplotka bwplotka 1.9G Jan 10 13:29 index
-rw-r--r-- 1 bwplotka bwplotka 287M Jan 10 13:29 index.cache.json
-rw-r--r-- 1 bwplotka bwplotka 122M Jan 10 13:29 index-header

NOTE: Size is smaller, but it's not what we are trying to optimize for. Nevertheless
PostingOffsets and Symbols takes a significant amount of bytes. The only downsides of size
is the fact that to create such index-header we have to fetch those two parts ~60MB each from object storage.
Idea for another improvement at some point would if it will become a problem: Cache only 32th of the posting ranges and fetch gaps between on demand
on query time (with some cache).

Real-time latencies for creation and loading (without network traffic):

For 10k block it's similar for both (ms/micros), for 8mln we can spot the difference:

index-header:

  • write 134.197732ms
  • read 415.971774ms

index-cache.json:

  • write 6.712496338s
  • read 6.112222132s

Go Benchmarks:

Before comparing I changed names to correlate tests with benchcmp:

BenchmarkJSONReader-12-> BenchmarkRead-12 old
BenchmarkBinaryReader-12 -> BenchmarkRead-12 new
BenchmarkJSONWrite-12 -> BenchmarkWrite-12 old
BenchmarkBinaryWrite-12  -> BenchmarkWrite-12 new

10k series block:

benchmark             old ns/op     new ns/op     delta
BenchmarkRead-12      591780        66613         -88.74%
BenchmarkWrite-12     2458454       6532651       +165.72%

benchmark             old allocs     new allocs     delta
BenchmarkRead-12      2306           629            -72.72%
BenchmarkWrite-12     1995           64             -96.79%

benchmark             old bytes     new bytes     delta
BenchmarkRead-12      150904        32976         -78.15%
BenchmarkWrite-12     161501        73412         -54.54%

CPU time for smaller index file is interesting. Value is low anyway. Might be
something to follow up.

8mln series (index takes 2GB so not committed to git):

benchmark             old ns/op      new ns/op     delta
BenchmarkRead-12      7026290474     552913402     -92.13%
BenchmarkWrite-12     6480769814     276441977     -95.73%

benchmark             old allocs     new allocs     delta
BenchmarkRead-12      20100014       5501312        -72.63%
BenchmarkWrite-12     18263356       64             -100.00%

benchmark             old bytes      new bytes     delta
BenchmarkRead-12      1873789526     406021516     -78.33%
BenchmarkWrite-12     2385193317     74187         -100.00%

NOTE: Benchmarks for query time and CHANGELOG will go in next PR.

Signed-off-by: Bartlomiej Plotka bwplotka@gmail.com

@bwplotka bwplotka force-pushed the build-index-header-v2 branch 8 times, most recently from 1855315 to ca2690d Compare January 10, 2020 16:57
@bwplotka bwplotka marked this pull request as ready for review January 10, 2020 17:12
@bwplotka bwplotka force-pushed the build-index-header-v2 branch 3 times, most recently from 52b01fd to 2163a97 Compare January 10, 2020 17:27
bwplotka added a commit that referenced this pull request Jan 10, 2020
Depends on: #1952

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
bwplotka added a commit that referenced this pull request Jan 10, 2020
Enabled it also on all our tests.

Depends on: #1952

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
bwplotka added a commit that referenced this pull request Jan 10, 2020
Enabled it also on all our tests.

Depends on: #1952

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
bwplotka added a commit that referenced this pull request Jan 10, 2020
Enabled it also on all our tests.

Depends on: #1952

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
bwplotka added a commit that referenced this pull request Jan 11, 2020
Enabled it also on all our tests.

Depends on: #1952

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
bwplotka added a commit that referenced this pull request Jan 11, 2020
Enabled it also on all our tests.

Depends on: #1952

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
bwplotka added a commit that referenced this pull request Jan 11, 2020
Enabled it also on all our tests.

Depends on: #1952

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
Copy link
Member

@GiedriusS GiedriusS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for picking up the work on this! Unfortunately I didn't have enough time back in the day. Obviously I'm not an super expert on the TSDB format but the code seems clean and the edge cases were thought through AFAICT. Took me some time to wrap my head around all of this. Just some minor comments. The cherry on top of the pie would be to add some kind of flag to Thanos Compact which would migrate all of the existing JSON files into the binary ones and improve the binary reader to try to download them from remote object storage. But I guess that could be added when we will move this from the experimental stage.

LookupSymbol(o uint32) (string, error)
LabelValues(name string) []string

// LabelValues returns all label values for given label name or error if not found.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment here seems a bit wrong: we return an error when an actual error had occurred, not when we haven't found any label values.

type BinaryTOC struct {
// Symbols holds start to the same symbols section as index related to this index header.
Symbols uint64
// PostingsTable holds start to the the same Postings Offset Table section as index related to this index header.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment should start here with PostingsOffsetTable.

}
// TODO(bwplotka): This is wrong, probably we have to sort.
if lastKey != nil {
prevRng.End = int64(off + 4)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a lot of places we add 4 to different offsets. I guess that's because of how big crc32 is. I think the code's readability could be improved if we would move that constant 4 into a variable. crc32.Size could probably be reused.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, but it's not crc, but size of length of posting.

if len(key) != 2 {
return errors.Errorf("unexpected key length for posting table %d", len(key))
}
// TODO(bwplotka): This is wrong, probably we have to sort.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So should we sort this or not? 😄

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not ;p

func compareIndexToHeader(t *testing.T, indexByteSlice index.ByteSlice, headerReader Reader) {
indexReader, err := index.NewReader(indexByteSlice)
testutil.Ok(t, err)
defer func() { _ = indexReader.Close() }()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we check that this succeeds?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, because it's a Prometheus indexReader and we are not testing that one.

Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Magnificent job @bwplotka ! The change set LGTM. The only reason why I haven't approved is because I don't feel comfortable enough to deeply review BinaryReader. I left few comments here and there, mainly questions.

return nil, errors.Wrap(d.Err(), "read postings list")
}
// 4 for posting length, then n * 4, foreach each big endian posting.
return b[:4+n*4], nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we double check there are actually 4+n*4 bytes in b?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch!

c := b[p.ptr.Start-start : p.ptr.End-start]

_, fetchedPostings, err := r.dec.Postings(c)
// index-header can estimate endings, which means we need to resize the endings.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't be the index-header reader responsability to return the right offsets, instead of having resizePostings() here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't due to arbitrary padding that can occur - it's fine though, overfetching is usually small.

We can only exactly resize once we know number of entries from fetched posting.

cur uint32
}

// TODO(bwplotka): Expose those inside Prometheus.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

return br, nil
}

level.Warn(logger).Log("msg", "failed to read index-header from disk; recreating", "path", binfn, "err", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a warning? Looks more an Info() to me, given we don't expect to find the index-header on disk the first time a block is seen.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Definitely! I would say even debug.


level.Warn(logger).Log("msg", "failed to read index-header from disk; recreating", "path", binfn, "err", err)

if err := WriteBinary(ctx, bkt, id, binfn); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice tracking an histogram metric with the time it takes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm.. let's start with elapsed time in log line WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree

r.postings[key[0]] = []postingOffset{}
if lastKey != nil {
// Always include last value for each label name.
r.postings[lastKey[0]] = append(r.postings[lastKey[0]], postingOffset{value: lastKey[1], tableOff: lastTableOff})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Despite the logic is correct, may be more clear if we also set lastKey = nil after this line. Up to you.

}
if i+1 == len(e) {
for i := range tmpRngs {
tmpRngs[i].End = r.indexLastPostingEnd
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that when we'll load the postings, if we're reading the first one we're going to download from the bucket the entire postings section?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow ❤️ You just found a major bug here! Nice - I fixed it and made tests more strict! Awesome!

This PR adds index-header implementation based on [this design](https://thanos.io/proposals/201912_thanos_binary_index_header.md/)

It adds a separate indexheader.Binary* structs and method allowing to build and read index-header in binary format.

Size difference:

10k series for my autogenerated data it's 2.1x

-rw-r--r-- 1 bwplotka bwplotka 6.1M Jan 10 13:20 index
-rw-r--r-- 1 bwplotka bwplotka  23K Jan 10 13:20 index.cache.json
-rw-r--r-- 1 bwplotka bwplotka 9.2K Jan 10 13:20 index-header

For realistic block 8mln series, also similar gain.

-rw-r--r-- 1 bwplotka bwplotka 1.9G Jan 10 13:29 index
-rw-r--r-- 1 bwplotka bwplotka 287M Jan 10 13:29 index.cache.json
-rw-r--r-- 1 bwplotka bwplotka 122M Jan 10 13:29 index-header

NOTE: Size is smaller, but it's not what we are trying to optimize for. Nevertheless
PostingOffsets and Symbols takes significant amount of bytes. The only downsides of size
is the fact that to create such index-header we have to fetch those two parts ~60MB each from object storage.
Idea for improvement if that will become a problem: Cache only 32th of the posting ranges and fetch gaps between on demand
on query time (with some cache).

Real time latencies for creation and loading (without network traffic):

For 10k block it's similar for both (ms/micros), for 8mln we can spot the difference:

index-header:

* write 134.197732ms
* read 415.971774ms

index-cache.json:

* write 6.712496338s
* read 6.112222132s

Before comparing I changed names to correlate tests:

BenchmarkJSONReader-12-> BenchmarkRead-12 old
BenchmarkBinaryReader-12 -> BenchmarkRead-12 new
BenchmarkJSONWrite-12 -> BenchmarkWrite-12 old
BenchmarkBinaryWrite-12  -> BenchmarkWrite-12 new

benchmark             old ns/op     new ns/op     delta
BenchmarkRead-12      591780        66613         -88.74%
BenchmarkWrite-12     2458454       6532651       +165.72%

benchmark             old allocs     new allocs     delta
BenchmarkRead-12      2306           629            -72.72%
BenchmarkWrite-12     1995           64             -96.79%

benchmark             old bytes     new bytes     delta
BenchmarkRead-12      150904        32976         -78.15%
BenchmarkWrite-12     161501        73412         -54.54%

CPU time for smaller index file is interesting. Value is low anyway. Might be
something to follow up.

benchmark             old ns/op      new ns/op     delta
BenchmarkRead-12      7026290474     552913402     -92.13%
BenchmarkWrite-12     6480769814     276441977     -95.73%

benchmark             old allocs     new allocs     delta
BenchmarkRead-12      20100014       5501312        -72.63%
BenchmarkWrite-12     18263356       64             -100.00%

benchmark             old bytes      new bytes     delta
BenchmarkRead-12      1873789526     406021516     -78.33%
BenchmarkWrite-12     2385193317     74187         -100.00%

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
Enabled it also on all our tests.

Depends on: #1952

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
@bwplotka
Copy link
Member Author

Thanks for reviews! All should be addressed.

@GiedriusS:

The cherry on top of the pie would be to add some kind of flag to Thanos Compact which would migrate all of the existing JSON files into the binary ones and improve the binary reader to try to download them from remote object storage. But I guess that could be added when we will move this from the experimental stage.

I would say, remove all index-cache JSON file as we build all on-demand on store Gateway (:

@pracucci kudos for finding quite a major bug, I think the Series() micro-benchmark would find this.. if we would have any (: TODO for next PR to iterate on index-header.

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
@bwplotka
Copy link
Member Author

Should be rdy to go, so we can iterate on it later on 👍

It is hidden behind an experimental flag for now.

@squat @brancz

return nil, errors.Wrap(err, "write index header")
}

level.Debug(logger).Log("msg", "built index-header file", "path", binfn, "elapsed", time.Since(start))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given we don't have any metric whenever an index-header is created, if there's a bug which re-creates the index-header for a block each time (because the reader detects it as corrupted) it would be very difficult to notice, unless you run Thanos with debug logging. Switching this log to Info may help.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. In this case we need a metric (:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add that in next PR. (:

@bwplotka bwplotka merged commit 171889d into master Jan 21, 2020
@bwplotka bwplotka deleted the build-index-header-v2 branch January 21, 2020 13:33
@farvour
Copy link

farvour commented Feb 28, 2020

This feature really helped. Much better than mmapping the really large index json files that were being generated by our large prometheus cluster. Combined with the memcached index caching I can finally lower the memory requirements significantly in our clusters for store gateway.

Hope to see more improvements.

@bwplotka
Copy link
Member Author

bwplotka commented Feb 28, 2020 via email

@der-eismann
Copy link

Just to give some feedback: The memory consumption of our store gateway sank from 28,9 GB to 27,2 GB, which IMHO isn't such a huge improvement. But we still have no idea how the index-cache-size and chunk-pool-size options affect this. We don't use memcached yet.
However the CPU usage is now significantly lower, reduced from 0,5 core spikes to 0,1 core spikes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants