Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store-gateway: inefficient chunks fetching and caching #3939

Closed
pracucci opened this issue Jan 12, 2023 · 3 comments · Fixed by #6017
Closed

Store-gateway: inefficient chunks fetching and caching #3939

pracucci opened this issue Jan 12, 2023 · 3 comments · Fixed by #6017

Comments

@pracucci
Copy link
Collaborator

pracucci commented Jan 12, 2023

I analysed the bytes touched vs fetched in some of our production Mimir clusters over the last 24h. I used the metrics cortex_bucket_store_series_data_size_touched_bytes_sum and cortex_bucket_store_series_data_size_fetched_bytes_sum, and got the following (numbers are GBs):

Screenshot 2023-01-12 at 15 31 06

  • Touched: GBs processed by data type
  • Fetched: GBs fetched from the object storage by data type (doesn't include data fetched from the cache, except for chunks)

Ideally, we would expect Fetched to be less than Touched, because some data will be fetched from the cache (series and postings) and similar for chunks. That's true for "series" and "postings", but not for "chunks". We touched 58TB of chunks, but fetched 504TB (9x more).

Why the Touched vs Fetched discrepancy?

First of all, we need to understand how the metric is tracked for chunks:

  • Touched: computed on the actual chunk length (code)
  • Fetched: computed on the actual byte ranges we fetch from the object storage (code)

Why we fetch 9x more than the actual chunk sizes? This is an effect of two different implementation details:

  1. When looking up the index, we don't know the chunk length, so we assume the worst case scenario of 16000 bytes per chunk (here and here).
  2. In order to reduce the number of API calls to the object storage, we use a partitioner to aggregate close (but not adjacent) byte ranges into the same GET request (code). Basically we intentionally over read from the object storage.

Why does this affect cache too?

Chunks caching is implemented as a wrapper of the object storage client. We don't cache individual chunks, but we do cache portions of objects containing chunks (called "TSDB segment files"). Each cached portion is 16KB.

The current logic is as follows:

  1. Run the partitioner, extending the ranges to fetch
  2. For each partition, call object storage GetRange()
    1. The "caching bucket" lookups from the cache all 16KB portions within the requested range
    2. The "caching bucket" delegates to the underlying object storage client to read missing ranges from the object storage

Since cache lookup happens after the partitioner, it means that we're reading from memcached even ranges we don't actually need and will be discarded.

How much does the partitioner over read?

Querying partitioner metrics, we can see how much over read is done by the partitioner:

Screenshot 2023-01-12 at 15 54 23

We requested 323TB and the partitioner expanded it to 521TB. Looking at this data, the partitioner is over-reading by a factor of 1.6x which is not that bad.

Why partitioner effect is 1.6x, but fetched vs touched bytes is 9x?

My theory is that the reason is that we compute initial chunk ranges based on the worst case of 16000 bytes per chunk (since we don't know the actual chunk).

We know on average a sample is 1.4 bytes. A chunk is on average 120 samples, so 120 * 1.4 = 168 bytes which is way far from the worst case scenario we consider.

We also know that the average scrape interval across all our customers is 30s. Assuming chunks are consecutive in the segment file and we query 24h blocks (the largest block compacted by Mimir), all chunks of a series are in a range of (86400 / 30) * 1.4 = ~4KB which is still 4x smaller than the minimum range of 16KB we fetch from the cache.

Summing together the two issues, and the fact that cached portions are also aligned to 16KB, math gets quite close to a 9x over-reading inefficiency.

Reference queries

GBs touched:

sum by(data_type) (increase(cortex_bucket_store_series_data_size_touched_bytes_sum{namespace=~"..."}[1d]))
/ 1024^3

GBs fetched:

sum by(data_type) (increase(cortex_bucket_store_series_data_size_fetched_bytes_sum{namespace=~"..."}[1d]))
/ 1024^3

Partitioner's requested bytes total:

sum (increase(cortex_bucket_store_partitioner_requested_bytes_total{namespace=~"..."}[1d]))

Partitioner's expanded bytes total:

sum (increase(cortex_bucket_store_partitioner_expanded_bytes_total{namespace=~"..."}[1d]))
@pstibrany
Copy link
Member

Chunk size frequency from a random 24h block from our monitoring cluster (30 segment files out of 249 in the block, around 95M of chunks):

$ tsdb-chunks 00* | mygrep -r '$1' -m -nh 'length: (\d+) ' | promfreq -mode=exponential -factor=2 -count=15

      (-∞ .. 1] ▏ 0 (0.0 %)
       (1 .. 2] ▏ 0 (0.0 %)
       (2 .. 4] ▏ 0 (0.0 %)
       (4 .. 8] ▏ 0 (0.0 %)
      (8 .. 16] ▏ 0 (0.0 %)
     (16 .. 32] ▊ 745401 (0.8 %)
     (32 .. 64] █████████████████████████▊ 29349226 (30.8 %)
    (64 .. 128] ██████████████████████████████▏ 34301358 (35.9 %)
   (128 .. 256] ███████████▏ 12700317 (13.3 %)
   (256 .. 512] ████████████▍ 14075654 (14.7 %)
  (512 .. 1024] ███▊ 4228234 (4.4 %)
 (1024 .. 2048] ▏ 41523 (0.0 %)
 (2048 .. 4096] ▏ 32 (0.0 %)
 (4096 .. 8192] ▏ 0 (0.0 %)
(8192 .. 16384] ▏ 0 (0.0 %)
  (16384 .. +∞) ▏ 0 (0.0 %)

summary:
 count=95441745, p50=96.88731927173262, p90=416.0724723696675, p95=502.86448473371115, p99=913.4608069468243, avg=156.80059002483662, min=17, max=2134

@dimitarvdimitrov
Copy link
Contributor

In an upcoming PR I'll be using the chunk actual length when fetching chunks from the bucket. For a series we can calculate the size of each chunk as the difference between the chunk refs of two consecutive chunks. This works on the assumption that chunks in the segment files are ordered by the series to which they belong. And this works for all chunks but the last chunk of a series. For the last chunk we still have to estimate its length. (Technically we can look at the chunk ref of the first chunk of the next series, but that's difficult right now)

A naive estimation

For this estimation I initially tried with an estimation of 2x the size of the largest chunk whose size we know for certain. This didn't work for cases where we have very few chunks (e.g. with high churn we might have one or two chinks per series) or when the chunks were sparse and covered very different time ranges (30s vs 30m).

Using an estimation of 2x caused a underestimation for 0.6% of series in some requests. This increased the number of requests to the object store by 25% (since we had to refetch those series individually and couldn't group them in the same request with the help of the partitioner). Latency was also negative affected.

Underestimation rate, object store request rate, and latency zone-a was using these estimations and my future changes, zones b and c were using the `main` implementation Screenshot 2023-02-06 at 14 52 50 Screenshot 2023-02-06 at 14 50 40 Screenshot 2023-02-06 at 14 50 48

Last chunk size analysis

This prompted me to find out what would be a better estimation for chunk sizes.

What I wanted to find out what given the size of the max chunk of a series (excluding the last chunk), how big will the last chunk be? Using the block index I ran a similar analysis as Peter's on the size of chunks in a few blocks. I bucketed the size of the max chunk into powers of two and calculated percentiles for the size of the last chunk (in my analysis I almost always knew the size of the last chunk).

TL;DR: There is a lot more variability to last chunk size when the max known chunk is small (<256). And for series where the max chunk size was big (<4096), then variability is lower and the last chunk was always below 2500 B.

__Results__

I analysed the chunks of 63M series from 6 blocks from different tenants. I wrote this tool to read and parse the index from the bucket and then wrote a small go program to parse the output, bucket the results and calculate percentiles.

Terms:
* max chunk size: the size of largest chunk in timeseries _excluding_ the last chunk
* last chunk size: actual size of the last chunk

Explanation:
    128 (n=0, 0%)
        N/A
        ^
        |
        # There were no series where the size of the max chunk was <=128

    256 (n=8229338, 13.0%)
        ^
        |
        |
        # 13% (or 8229338) of series in the input had a max chunk size in (128, 256]

        p10.000   0.749035    194 B  (9.0909%)
        ^         ^           ^       ^
        |         |           |       |
        # percentile of series where the max chunk was <=256B (p10)
                  |           |       |
                  # In the bottom 10% of series where max chunk was <=256B the last chunk was 0.749x that of the max chunk (sorted by ratio)
                              |       |
                              # In the bottom 10% of series where max chunk was <=256B the last chunk was 194 B (sorted by last chunk)
                                      |
                                      # Series with last chunk larger than 0.749035x or 194 B as % of all series in the input.

16 (n=0, 0.0%)
	N/A
32 (n=3888565, 6.2%)
	p10.0000  0.896552     23 B	(5.54874%)
	p50.0000  1.000000     28 B	(3.08263%)
	p90.0000  1.793103     48 B	(0.61653%)
	p99.0000  6.406250    171 B	(0.06165%)
	p99.9000  17.392857    463 B	(0.00617%)
	p99.9900  26.347826    744 B	(0.00062%)
	p99.9990  38.652176    923 B	(0.00006%)
	p99.9999  54.000000   1272 B	(0.00001%)
64 (n=9939473, 15.8%)
	p10.0000  0.650000     31 B	(14.18300%)
	p50.0000  0.954545     54 B	(7.87945%)
	p90.0000  1.206897     63 B	(1.57589%)
	p99.0000  5.293103    253 B	(0.15759%)
	p99.9000  9.615385    483 B	(0.01576%)
	p99.9900  20.027779    780 B	(0.00158%)
	p99.9990  31.297297   1210 B	(0.00016%)
	p99.9999  37.676472   1363 B	(0.00002%)
128 (n=21282638, 33.7%)
	p10.0000  0.472973     41 B	(30.36898%)
	p50.0000  0.757282     61 B	(16.87166%)
	p90.0000  0.985075    102 B	(3.37433%)
	p99.0000  1.972973    190 B	(0.33743%)
	p99.9000  6.364407    517 B	(0.03374%)
	p99.9900  9.296610    942 B	(0.00338%)
	p99.9990  11.655556   1094 B	(0.00034%)
	p99.9999  17.142857   1213 B	(0.00003%)
256 (n=14146232, 22.4%)
	p10.0000  0.308140     53 B	(20.18578%)
	p50.0000  0.712121    113 B	(11.21432%)
	p90.0000  0.964286    192 B	(2.24287%)
	p99.0000  1.300752    255 B	(0.22429%)
	p99.9000  3.682609    640 B	(0.02243%)
	p99.9900  5.264516    944 B	(0.00224%)
	p99.9990  7.775194   1100 B	(0.00023%)
	p99.9999  8.638462   1741 B	(0.00002%)
512 (n=8905849, 14.1%)
	p10.0000  0.224319     77 B	(12.70809%)
	p50.0000  0.816327    273 B	(7.06005%)
	p90.0000  0.993528    400 B	(1.41201%)
	p99.0000  1.113839    492 B	(0.14120%)
	p99.9000  1.883721    701 B	(0.01412%)
	p99.9900  3.221402    959 B	(0.00141%)
	p99.9990  4.165385   1443 B	(0.00014%)
	p99.9999  7.254613   1975 B	(0.00001%)
1024 (n=4199038, 6.7%)
	p10.0000  0.238195    162 B	(5.99176%)
	p50.0000  0.880510    540 B	(3.32876%)
	p90.0000  0.990826    875 B	(0.66575%)
	p99.0000  1.070946   1006 B	(0.06658%)
	p99.9000  1.514925   1111 B	(0.00666%)
	p99.9900  2.021277   1567 B	(0.00067%)
	p99.9990  2.488916   1990 B	(0.00007%)
	p99.9999  3.847195   2095 B	(0.00001%)
2048 (n=655889, 1.0%)
	p10.0000  0.197719    262 B	(0.93591%)
	p50.0000  0.676252    901 B	(0.51995%)
	p90.0000  0.974245   1236 B	(0.10399%)
	p99.0000  1.077839   1790 B	(0.01040%)
	p99.9000  1.489215   2046 B	(0.00104%)
	p99.9900  1.874067   2212 B	(0.00010%)
	p99.9990  1.970027   2416 B	(0.00001%)
	p99.9999  2.089918   2437 B	(0.00000%)
4096 (n=54047, 0.1%)
	p10.0000  0.218945    475 B	(0.07712%)
	p50.0000  0.532983   1171 B	(0.04285%)
	p90.0000  0.860074   1869 B	(0.00857%)
	p99.0000  1.006455   2230 B	(0.00086%)
	p99.9000  1.086239   2385 B	(0.00009%)
	p99.9900  1.136015   2437 B	(0.00001%)
	p99.9990  1.158329   2482 B	(0.00000%)
	p99.9999  1.158329   2482 B	(0.00000%)
+Inf (n=431, 0.0%)
	p10.0000  0.000000     55 B	(0.00062%)
	p50.0000  0.000000    299 B	(0.00034%)
	p90.0000  0.000000    876 B	(0.00007%)
	p99.0000  0.000000   1318 B	(0.00001%)
	p99.9000  0.128144   1783 B	(0.00000%)
	p99.9900  0.128144   1783 B	(0.00000%)
	p99.9990  0.128144   1783 B	(0.00000%)
	p99.9999  0.128144   1783 B	(0.00000%)

Making an estimation

How often this estimation can be below the actual size should be based on our cache hit rate since we wouldn't use the estimation for already cache chunks.

In production at Grafana Labs we usually have a usually high hit rate for the chunks cache (96.3% in the last 30 days over all Mimir clusters).

If we want to have to refetch 1 batch in 10, then with a hit rate of 96.3%, we can afford to underfetch (and refetch) the last chunk of 0.054% of series (1 /*one series*/ / (10 /*batches*/ x 5000 /*series per batch*/ x (1 - 0.963) /*cache miss rate*/) = 0.0005405405). So we need to overestimate 99.946% of last chunks.

The following estimations would satisfy this for each max chunk size bucket:

32 (n=3888565, 6.2%)
	p99.9000  17.392857    463 B (0.00617%)
64 (n=9939473, 15.8%)
	p99.9000  9.615385    483 B	(0.01576%)
128 (n=21282638, 33.7%)
	p99.9900  9.296610    942 B	(0.00338%)
256 (n=14146232, 22.4%)
	p99.9900  5.264516    944 B	(0.00224%)
512 (n=8905849, 14.1%)
	p99.9000  1.883721    701 B	(0.01412%)
1024 (n=4199038, 6.7%)
	p99.9000  1.514925   1111 B	(0.00666%)
2048 (n=655889, 1.0%)
	p99.9000  1.489215   2046 B	(0.00104%)
4096 (n=54047, 0.1%)
	p99.9999  1.158329   2482 B	(0.00000%)

These estimations would cover 99.95063% of series.

Ratio vs chunk size

I'm not sure whether to do an estimation with a fixed size for each bucket or use a multiplier of the actual max chunk size on a per-series basis. The worst case is worse when using a ratio, but I'm not sure about the average case. I'm tempted to do with the fixed size since it looks easier.

@dimitarvdimitrov
Copy link
Contributor

Here's long-overdue update: we've been running the fine-grained chunks changes at Grafana Labs for some time in February and March and more recently trialing it in a few clusters.

Unfortunately, we didn't see significant improvements in most Mimir clusters. In most cases it leads to a single digit percentage decrease in in-use heap at the cost of 20-50% increased object store operations and 10-20% increased latency. There was once cluster where the decrease in heap was ~25%, but that wasn't enough to justify keeping the feature.

So we took a decision to start removing the feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants