Skip to content

Commit

Permalink
Added binary index header implementation with benchmarks.
Browse files Browse the repository at this point in the history
This PR adds index-header implementation based on [this design](https://thanos.io/proposals/201912_thanos_binary_index_header.md/)

It adds a separate indexheader.Binary* structs and method allowing to build and read index-header in binary format.

## Stats

Size difference:

10k series for my autogenerated data it's 2.1x

-rw-r--r-- 1 bwplotka bwplotka 6.1M Jan 10 13:20 index
-rw-r--r-- 1 bwplotka bwplotka  23K Jan 10 13:20 index.cache.json
-rw-r--r-- 1 bwplotka bwplotka 9.2K Jan 10 13:20 index-header

For realistic block 8mln series, also similar gain.

-rw-r--r-- 1 bwplotka bwplotka 1.9G Jan 10 13:29 index
-rw-r--r-- 1 bwplotka bwplotka 287M Jan 10 13:29 index.cache.json
-rw-r--r-- 1 bwplotka bwplotka 122M Jan 10 13:29 index-header

NOTE: Size is smaller, but it's not what we are trying to optimize for. Nevertheless
PostingOffsets and Symbols takes significant amount of bytes. The only downsides of size
is the fact that to create such index-header we have to fetch those two parts ~60MB each from object storage.
Idea for improvement if that will become a problem: Cache only 32th of the posting ranges and fetch gaps between on demand
on query time (with some cache).

Real time latencies for creation and loading (without network traffic):

For 10k block it's similar for both (ms/micros), for 8mln we can spot the difference:

index-header:

* write 134.197732ms
* read 415.971774ms

index-cache.json:

* write 6.712496338s
* read 6.112222132s

## Go Benchmarks:

Before comparing I changed names to correlate tests:

BenchmarkJSONReader-12-> BenchmarkRead-12 old
BenchmarkBinaryReader-12 -> BenchmarkRead-12 new
BenchmarkJSONWrite-12 -> BenchmarkWrite-12 old
BenchmarkBinaryWrite-12  -> BenchmarkWrite-12 new

### 10k series block:

benchmark             old ns/op     new ns/op     delta
BenchmarkRead-12      591780        66613         -88.74%
BenchmarkWrite-12     2458454       6532651       +165.72%

benchmark             old allocs     new allocs     delta
BenchmarkRead-12      2306           629            -72.72%
BenchmarkWrite-12     1995           64             -96.79%

benchmark             old bytes     new bytes     delta
BenchmarkRead-12      150904        32976         -78.15%
BenchmarkWrite-12     161501        73412         -54.54%


CPU time for smaller index file is interesting. Value is low anyway. Might be
something to follow up.

### 8mln series (index takes 2GB so not committed to git):

benchmark             old ns/op      new ns/op     delta
BenchmarkRead-12      7026290474     552913402     -92.13%
BenchmarkWrite-12     6480769814     276441977     -95.73%

benchmark             old allocs     new allocs     delta
BenchmarkRead-12      20100014       5501312        -72.63%
BenchmarkWrite-12     18263356       64             -100.00%

benchmark             old bytes      new bytes     delta
BenchmarkRead-12      1873789526     406021516     -78.33%
BenchmarkWrite-12     2385193317     74187         -100.00%


Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
  • Loading branch information
bwplotka committed Jan 10, 2020
1 parent 718e51a commit 2163a97
Show file tree
Hide file tree
Showing 17 changed files with 1,376 additions and 106 deletions.
59 changes: 59 additions & 0 deletions docs/components/store.md
Original file line number Diff line number Diff line change
Expand Up @@ -221,3 +221,62 @@ While the remaining settings are **optional**:
- `max_get_multi_concurrency`: maximum number of concurrent connections when fetching keys. If set to `0`, the concurrency is unlimited.
- `max_get_multi_batch_size`: maximum number of keys a single underlying operation should fetch. If more keys are specified, internally keys are splitted into multiple batches and fetched concurrently, honoring `max_get_multi_concurrency`. If set to `0`, the batch size is unlimited.
- `dns_provider_update_interval`: the DNS discovery update interval.


## Index Header

In order to query series inside blocks from object storage, Store Gateway has to know certain initial info about each block such as:

* symbols table to unintern string values
* postings offset for posting lookup

In order to achieve so, on startup for each block `index-header` is built from pieces of original block's index and stored on disk.
Such `index-header` file is then mmaped and used by Store Gateway.

### Format (version 1)

The following describes the format of the `index-header` file found in each block store gateway local directory.
It is terminated by a table of contents which serves as an entry point into the index.

```
┌─────────────────────────────┬───────────────────────────────┐
│ magic(0xBAAAD792) <4b> │ version(1) <1 byte> │
├─────────────────────────────┬───────────────────────────────┤
│ index version(2) <1 byte> │ index PostingOffsetTable <8b> │
├─────────────────────────────┴───────────────────────────────┤
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Symbol Table (exact copy from original index) │ │
│ ├─────────────────────────────────────────────────────────┤ │
│ │ Posting Offset Table (exact copy from index) │ │
│ ├─────────────────────────────────────────────────────────┤ │
│ │ TOC │ │
│ └─────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
```
When the index is written, an arbitrary number of padding bytes may be added between the lined out main sections above. When sequentially scanning through the file, any zero bytes after a section's specified length must be skipped.
Most of the sections described below start with a `len` field. It always specifies the number of bytes just before the trailing CRC32 checksum. The checksum is always calculated over those `len` bytes.
### Symbol Table
See [Symbols](https://github.com/prometheus/prometheus/blob/d782387f814753b0118d402ec8cdbdef01bf9079/tsdb/docs/format/index.md#symbol-table)
### Postings Offset Table
See [Posting Offset Table](https://github.com/prometheus/prometheus/blob/d782387f814753b0118d402ec8cdbdef01bf9079/tsdb/docs/format/index.md#postings-offset-table)
### TOC
The table of contents serves as an entry point to the entire index and points to various sections in the file.
If a reference is zero, it indicates the respective section does not exist and empty results should be returned upon lookup.
```
┌─────────────────────────────────────────┐
│ ref(symbols) <8b> │
├─────────────────────────────────────────┤
│ ref(postings offset table) <8b> │
├─────────────────────────────────────────┤
│ CRC32 <4b> │
└─────────────────────────────────────────┘
```
2 changes: 1 addition & 1 deletion go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@ require (
github.com/prometheus/client_model v0.0.0-20190812154241-14fe0d1b01d4
github.com/prometheus/common v0.7.0
github.com/prometheus/procfs v0.0.6 // indirect
github.com/prometheus/prometheus v1.8.2-0.20200107122003-4708915ac6ef // master ~ v2.15.2
github.com/prometheus/prometheus v1.8.2-0.20200110114423-1e64d757f711 // master ~ v2.15.2
github.com/samuel/go-zookeeper v0.0.0-20190923202752-2cc03de413da // indirect
github.com/satori/go.uuid v1.2.0 // indirect
github.com/smartystreets/assertions v1.0.1 // indirect
Expand Down
4 changes: 2 additions & 2 deletions go.sum
Original file line number Diff line number Diff line change
Expand Up @@ -446,8 +446,8 @@ github.com/prometheus/procfs v0.0.5/go.mod h1:4A/X28fw3Fc593LaREMrKMqOKvUAntwMDa
github.com/prometheus/procfs v0.0.6 h1:0qbH+Yqu/cj1ViVLvEWCP6qMQ4efWUj6bQqOEA0V0U4=
github.com/prometheus/procfs v0.0.6/go.mod h1:7Qr8sr6344vo1JqZ6HhLceV9o3AJ1Ff+GxbHq6oeK9A=
github.com/prometheus/prometheus v0.0.0-20180315085919-58e2a31db8de/go.mod h1:oAIUtOny2rjMX0OWN5vPR5/q/twIROJvdqnQKDdil/s=
github.com/prometheus/prometheus v1.8.2-0.20200107122003-4708915ac6ef h1:pYYKXo/zGx25kyViw+Gdbxd0ItIg+vkVKpwgWUEyIc4=
github.com/prometheus/prometheus v1.8.2-0.20200107122003-4708915ac6ef/go.mod h1:7U90zPoLkWjEIQcy/rweQla82OCTUzxVHE51G3OhJbI=
github.com/prometheus/prometheus v1.8.2-0.20200110114423-1e64d757f711 h1:uEq+8hKI4kfycPLSKNw844YYkdMNpC2eZpov73AvlFk=
github.com/prometheus/prometheus v1.8.2-0.20200110114423-1e64d757f711/go.mod h1:7U90zPoLkWjEIQcy/rweQla82OCTUzxVHE51G3OhJbI=
github.com/rcrowley/go-metrics v0.0.0-20181016184325-3113b8401b8a/go.mod h1:bCqnVzQkZxMG4s8nGwiZ5l3QUCyqpo9Y+/ZMZ9VjZe4=
github.com/rogpeppe/fastuuid v0.0.0-20150106093220-6724a57986af/go.mod h1:XWv6SoW27p1b0cqNHllgS5HIMJraePCO15w5zCzIWYg=
github.com/rogpeppe/fastuuid v1.2.0/go.mod h1:jVj6XXZzXRy/MSR5jhDC/2q6DgLz+nrA6LYCDYWNEvQ=
Expand Down
4 changes: 3 additions & 1 deletion pkg/block/block.go
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,10 @@ const (
MetaFilename = "meta.json"
// IndexFilename is the known index file for block index.
IndexFilename = "index"
// IndexCacheFilename is the canonical name for index cache file that stores essential information needed.
// IndexCacheFilename is the canonical name for json index cache file that stores essential information.
IndexCacheFilename = "index.cache.json"
// IndexHeaderFilename is the canonical name for binary index header file that stores essential information.
IndexHeaderFilename = "index-header"
// ChunksDirname is the known dir name for chunks with compressed samples.
ChunksDirname = "chunks"

Expand Down
35 changes: 4 additions & 31 deletions pkg/block/block_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,6 @@ package block

import (
"context"
"io"
"io/ioutil"
"os"
"path"
Expand All @@ -12,7 +11,6 @@ import (

"github.com/fortytw2/leaktest"
"github.com/go-kit/kit/log"
"github.com/pkg/errors"
"github.com/prometheus/prometheus/pkg/labels"
"github.com/thanos-io/thanos/pkg/objstore/inmem"
"github.com/thanos-io/thanos/pkg/testutil"
Expand Down Expand Up @@ -104,7 +102,7 @@ func TestUpload(t *testing.T) {
testutil.NotOk(t, err)
testutil.Assert(t, strings.HasSuffix(err.Error(), "/meta.json: no such file or directory"), "")
}
testutil.Ok(t, cpy(path.Join(tmpDir, b1.String(), MetaFilename), path.Join(tmpDir, "test", b1.String(), MetaFilename)))
testutil.Copy(t, path.Join(tmpDir, b1.String(), MetaFilename), path.Join(tmpDir, "test", b1.String(), MetaFilename))
{
// Missing chunks.
err := Upload(ctx, log.NewNopLogger(), bkt, path.Join(tmpDir, "test", b1.String()))
Expand All @@ -115,7 +113,7 @@ func TestUpload(t *testing.T) {
testutil.Equals(t, 1, len(bkt.Objects()))
}
testutil.Ok(t, os.MkdirAll(path.Join(tmpDir, "test", b1.String(), ChunksDirname), os.ModePerm))
testutil.Ok(t, cpy(path.Join(tmpDir, b1.String(), ChunksDirname, "000001"), path.Join(tmpDir, "test", b1.String(), ChunksDirname, "000001")))
testutil.Copy(t, path.Join(tmpDir, b1.String(), ChunksDirname, "000001"), path.Join(tmpDir, "test", b1.String(), ChunksDirname, "000001"))
{
// Missing index file.
err := Upload(ctx, log.NewNopLogger(), bkt, path.Join(tmpDir, "test", b1.String()))
Expand All @@ -125,7 +123,7 @@ func TestUpload(t *testing.T) {
// Only debug meta.json present.
testutil.Equals(t, 1, len(bkt.Objects()))
}
testutil.Ok(t, cpy(path.Join(tmpDir, b1.String(), IndexFilename), path.Join(tmpDir, "test", b1.String(), IndexFilename)))
testutil.Copy(t, path.Join(tmpDir, b1.String(), IndexFilename), path.Join(tmpDir, "test", b1.String(), IndexFilename))
testutil.Ok(t, os.Remove(path.Join(tmpDir, "test", b1.String(), MetaFilename)))
{
// Missing meta.json file.
Expand All @@ -136,7 +134,7 @@ func TestUpload(t *testing.T) {
// Only debug meta.json present.
testutil.Equals(t, 1, len(bkt.Objects()))
}
testutil.Ok(t, cpy(path.Join(tmpDir, b1.String(), MetaFilename), path.Join(tmpDir, "test", b1.String(), MetaFilename)))
testutil.Copy(t, path.Join(tmpDir, b1.String(), MetaFilename), path.Join(tmpDir, "test", b1.String(), MetaFilename))
{
// Full block.
testutil.Ok(t, Upload(ctx, log.NewNopLogger(), bkt, path.Join(tmpDir, "test", b1.String())))
Expand Down Expand Up @@ -170,31 +168,6 @@ func TestUpload(t *testing.T) {
}
}

func cpy(src, dst string) error {
sourceFileStat, err := os.Stat(src)
if err != nil {
return err
}

if !sourceFileStat.Mode().IsRegular() {
return errors.Errorf("%s is not a regular file", src)
}

source, err := os.Open(src)
if err != nil {
return err
}
defer source.Close()

destination, err := os.Create(dst)
if err != nil {
return err
}
defer destination.Close()
_, err = io.Copy(destination, source)
return err
}

func TestDelete(t *testing.T) {
defer leaktest.CheckTimeout(t, 10*time.Second)()

Expand Down
Loading

0 comments on commit 2163a97

Please sign in to comment.