[dbnode] Use ref to segment data for index results instead of alloc each #1839

robskillington · 2019-07-28T23:04:40Z

What this PR does / why we need it:

This greatly reduces the memory allocation usage when querying the index as it ensures the lifetime of the segment survives the query and returns just references to the index data.

Special notes for your reviewer:

Does this PR introduce a user-facing and/or backwards incompatible change?:

NONE

Does this PR require updating code package or user-facing documentation?:

NONE

codecov · 2019-07-28T23:18:07Z

Codecov Report

Merging #1839 into master will decrease coverage by 9.2%.
The diff coverage is 73.4%.

@@            Coverage Diff             @@
##           master    #1839      +/-   ##
==========================================
- Coverage    72.8%    63.6%    -9.3%     
==========================================
  Files        1006     1123     +117     
  Lines       86671   107039   +20368     
==========================================
+ Hits        63099    68079    +4980     
- Misses      19360    34603   +15243     
- Partials     4212     4357     +145

Flag	Coverage Δ
#aggregator	`79.5% <ø> (+16.4%)`	⬆️
#cluster	`56.4% <ø> (+13.3%)`	⬆️
#collector	`63.7% <ø> (+22.4%)`	⬆️
#dbnode	`64.9% <72.9%> (+3.8%)`	⬆️
#m3em	`59.6% <ø> (-1.8%)`	⬇️
#m3ninx	`61.2% <96.4%> (-6.6%)`	⬇️
#m3nsch	`51.1% <ø> (-24.4%)`	⬇️
#metrics	`17.7% <ø> (-11.7%)`	⬇️
#msg	`74.7% <ø> (-0.3%)`	⬇️
#query	`68.6% <ø> (+17.1%)`	⬆️
#x	`74.7% <71.3%> (+0.3%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9baa7da...cb0ebfb. Read the comment docs.

richardartoul

This looks ok to me although it would be nice if there was a place where you documented this entire lifecycle end-to-end so we don't have to piece it together manually next time we need to understand it

richardartoul · 2019-07-29T14:15:41Z

src/dbnode/server/server.go

-					zap.Error(err),
-				)
-			}
+			// if err := bsGauge.UpdateStringList(bootstrappers); err != nil {


did you mean to disable this?

Yeah this is meant to be re-enabled but it was crashing and I needed to test, I'll re-enable when it's fixed on master.

richardartoul · 2019-07-29T14:16:38Z

src/dbnode/storage/index/block.go

 		docsPool.Put(batch)
 	}()

+	// Register the executor to close when context closes
+	// so can copy the results.


This comment is a little confusing. I would assume you'd want to register it to the context so you dont have to copy it. Could you maybe clarify?

Correct, I think I mean to say "so we can avoid copying the results". Honestly I hadn't cleaned all this up yet.

richardartoul · 2019-07-29T14:17:03Z

src/dbnode/storage/index/block.go

@@ -865,10 +871,6 @@ func (b *block) queryWithSpan(
 		return false, err
 	}

-	if err := execCloser.Close(); err != nil {


lol interesting, did we have a double close?

Oh nah, this got moved to the context closing it when it closes. (SafeCloser wraps the exec close call and prevents it from being double closed, so the two calls existing wasn't an issue).

richardartoul · 2019-07-29T16:52:34Z

src/dbnode/storage/index/results.go

-	// the tsID's bytes.
-	r.resultsMap.Set(tsID, tags)
+	// It is assumed that the document is valid for the lifetime of the index
+	// results.


Can this comment somehow explain or point to the context stuff and an explanation of that?

Yeah, I'm trying to find the best place to document that.

richardartoul · 2019-07-29T17:12:37Z

src/m3ninx/index/segment/fst/segment.go

-	}, nil
+	}
+
+	// NB(r): The segment uses the context finalization to finalize


This might be a good place to explain the general approach and how it interacts with outstanding queries

mstump · 2019-08-05T21:31:03Z

Is there a container with this patch available on a public repo? I want tot test it out.

robskillington · 2019-08-06T01:52:12Z

@mstump about to merge and cut release in next few hours 👍

robskillington · 2019-08-06T05:13:22Z

@mstump let me just build a container for this change.

robskillington · 2019-08-06T05:29:00Z

@mstump here's a container for this branch:

quay.io/m3db/m3dbnode:index-and-recent-data-zero-copy

Manifest:
https://quay.io/repository/m3db/m3dbnode/manifest/sha256:c974cab45e7da0e7547dc46e84e9988eb8c113e2129566bd129cc3eb6027ee70

richardartoul · 2019-08-06T19:47:53Z

src/dbnode/encoding/m3tsz/encoder.go

-func (enc *encoder) Stream(opts encoding.StreamOptions) (xio.SegmentReader, bool) {
-	segment := enc.segment(byCopyResultType)
+func (enc *encoder) Stream(
+	ctx context.Context,


Take it or leave it, but it could be nice to make the zeroCopy behavior optional via the encoding.StreamOptions and put the context in their too so that we don't have to pollute this API (and create a context everytime we want to do this even when we dont need zero copy)

richardartoul · 2019-08-06T19:52:00Z

src/dbnode/persist/fs/merger.go

@@ -229,23 +230,24 @@ func (m *merger) Merge(
 		// Closing the context will finalize the data returned from
 		// mergeWith.Read(), but is safe because it has already been persisted
 		// to disk.
-		tmpCtx.BlockingClose()
+		// NB(r): Make sure to use BlockingCloseReset so can reuse the context.


isnt the reset handled by the reset right before the call to mergeWith.Read()?

Yeah the main thing we avoid using BlockingCloseReset() is to avoid the context going back to the pool. (i.e. the Reset as the suffix means, after closing do not put back in the pool, we explicitly want to reuse this context).

And the main reason need to use a pooled context here is since the context needs to refer to the pooled finalizers linked list now, or else it will allocate a ton in between uses in this tight loop.

robskillington · 2019-08-07T05:36:18Z

@mstump I published a new container with the latest changes for testing:

quay.io/m3db/m3dbnode:index-and-recent-data-zero-copy-2

mstump · 2019-08-07T14:23:41Z

It looks like the container only has the coordinator binary and doesn't have m3dbnode

/ # find / -name 'm3*'
/bin/m3coordinator
/etc/m3coordinator
/etc/m3coordinator/m3coordinator.yml

robskillington · 2019-08-07T19:55:31Z

@mstump yeah that is weird, I noticed that. I use the wrong dockerfile as I'd just prepared a coordinator image before.

Here's a working one:

quay.io/m3db/m3dbnode:index-and-recent-data-zero-copy-3

mstump · 2019-08-09T17:46:01Z

OK I got it deployed a day or two ago but didn't have time to test until just now.

It doesn't consume all available memory as was the case prior to the test but the node will become unresponsive under the load. The node will begin to fail health checks and will be restarted by Kubernetes, taking down the entire cluster. Prometheus stat pulls also fail.
CPU Doesn't max out, I'm not sure where the bottleneck is.
Memory usage increase is much more gradual than before and didn't reach the 50GB limit.
After applying the patch baseline CPU utilization increased from about 1.5 CPU to 2.0 CPUs

The workload is essentially CSV export for all metrics of a cluster at 1m granularity for a 60m time period. It's about 2-3k time series and the query looks like the following.

query_range?query=%7Bcluster_id%3D%22d752ae7a-6801-4a1b-9d00-7be6734ff409%22,+scope%3D%22CLUSTER%22%7D&step=60&end=1562814080&start=1562810480

Screen cap from load test

mstump · 2019-08-09T17:52:46Z

I should also mention that it's only 10 concurrent queries from this job. There is an additional baseline load resulting in 10k tagged RPC writes per second, and 6 analytical queries per second (7d queries over a 1h aggregated namespace) which are fronted by a dedicated query tier.

mstump · 2019-08-09T18:05:17Z

I repeated the test with a single thread of execution and it's still enough to cause nodes to go unresponsive.

mstump · 2019-08-09T18:19:46Z

With the 1 concurrent query test I removed load as soon as the nodes started to go unresponsive. The only message of note has to do with etcd. I checked etcd memory and CPU utilization during that time-frame and it was nominal maxing out at 67MB of memory and .015 CPU.

{"level":"error","ts":1565373847.3873844,"msg":"received error on watch channel","error":"etcdserver: no leader"}
{"level":"info","ts":1565373852.6725218,"msg":"finished handling request","rqID":"2e3f3728-ad49-4043-8adf-ca08294a9f2e","time":1565373852.6725035,"response":5.296629915,"url":"/api/v1/placement"}
{"level":"warn","ts":1565373854.0264885,"msg":"etcd watch channel closed on key, recreating a watch channel","key":"_sd.placement/m3db/metrics-cluster/m3db"}

richardartoul · 2019-08-09T18:23:22Z

@mstump cant comment on the the actual performance since this was Rob's work but the etcd stuff is usually a red herring. Usually it just means the node is under heavy load and as a result connections to etcd get broken and need to be re-established

notbdu · 2019-10-09T18:50:45Z

Seeing ~43% reduction in RSS memory when running a side by side comparison of feature to release.

Feature:

Release:

…allocating

…dless over time

…proto encoder

* Make bytesRef finalization called by RefCount Finalize call

robskillington changed the title ~~[dbnode] Use ref to segment data for index results instead of alloc each~~ WIP [dbnode] Use ref to segment data for index results instead of alloc each Jul 28, 2019

robskillington added the PR: Awaiting Review [db] label Jul 29, 2019

robskillington changed the title ~~WIP [dbnode] Use ref to segment data for index results instead of alloc each~~ [dbnode] Use ref to segment data for index results instead of alloc each Jul 29, 2019

richardartoul approved these changes Jul 29, 2019

View reviewed changes

robskillington removed the PR: Awaiting Review [db] label Aug 6, 2019

richardartoul reviewed Aug 6, 2019

View reviewed changes

robskillington mentioned this pull request Oct 7, 2019

[WIP] - Make FetchBatchRaw kick off all ID fetches in parallel #1952

Open

robskillington force-pushed the master branch from fdbef61 to 0c6a0fe Compare October 15, 2019 06:42

robskillington added 8 commits October 24, 2019 12:51

Use references to FST segment for returning index results instead of …

2c7d30f

…allocating

Remove updating string list

886f4c7

Fix reporting of the postings list cache

a7de48d

Reduce the number of postings list cache to avoid memory growing boun…

fcb003e

…dless over time

Change promremotebench settings

89dc0c7

Fix index cache

2403878

Adjust benchmark queries

f4b2289

Fix query

072fe06

robskillington and others added 22 commits October 24, 2019 12:51

Resize context pool lower

a33f344

Increase pool again

22c6de3

Use pooled contexts where possible

7a09ee8

Refactor to use checked bytes wrapper pool, add same optimization to …

be67982

…proto encoder

Generate mocks

8cbbb47

Fix tests

8d46491

Fix some tests

4401364

Fix delayed finalization

8b928f7

Remove unused vars

4cb4939

Generate ID list gen

c88661a

Fix tests

d3d8e58

Remove generate statement

c70af66

Add missing header

00a0d3c

Remove the namespaces generated type that wasn't used

ff992ed

Fix tests

662bfaf

Remove context options

b70bb57

Generate mocks

5368e21

Fix series tests and integration tests

95185a4

Fix m3tsz tests

b20e2a7

Fix big unit tests

e5ced9b

Make sure to only use context closing if query not cancelled

3317710

Patches checked.Bytes finalization logic (#2011)

3e9511b

* Make bytesRef finalization called by RefCount Finalize call

notbdu force-pushed the r/direct-refs-to-index-data branch from 3674995 to 3e9511b Compare October 24, 2019 18:45

notbdu and others added 5 commits October 24, 2019 14:49

Merge branch 'master' into r/direct-refs-to-index-data

5b1829d

Fix service code.

8a86f85

Merge branch 'master' into r/direct-refs-to-index-data

8b27718

Fix tests.

4d5fef8

Merge branch 'master' into r/direct-refs-to-index-data

cb0ebfb

robskillington merged commit e6d653b into master Oct 28, 2019

robskillington deleted the r/direct-refs-to-index-data branch October 28, 2019 03:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[dbnode] Use ref to segment data for index results instead of alloc each #1839

[dbnode] Use ref to segment data for index results instead of alloc each #1839

robskillington commented Jul 28, 2019

codecov bot commented Jul 28, 2019 •

edited

Loading

richardartoul left a comment

richardartoul Jul 29, 2019

robskillington Jul 30, 2019

richardartoul Jul 29, 2019

robskillington Jul 30, 2019

richardartoul Jul 29, 2019

robskillington Jul 30, 2019

richardartoul Jul 29, 2019

robskillington Jul 30, 2019

richardartoul Jul 29, 2019

robskillington Jul 30, 2019

mstump commented Aug 5, 2019

robskillington commented Aug 6, 2019

robskillington commented Aug 6, 2019

robskillington commented Aug 6, 2019

richardartoul Aug 6, 2019

richardartoul Aug 6, 2019

robskillington Oct 27, 2019

robskillington commented Aug 7, 2019

mstump commented Aug 7, 2019

robskillington commented Aug 7, 2019 •

edited

Loading

mstump commented Aug 9, 2019 •

edited

Loading

mstump commented Aug 9, 2019 •

edited

Loading

mstump commented Aug 9, 2019

mstump commented Aug 9, 2019

richardartoul commented Aug 9, 2019

notbdu commented Oct 9, 2019

[dbnode] Use ref to segment data for index results instead of alloc each #1839

[dbnode] Use ref to segment data for index results instead of alloc each #1839

Conversation

robskillington commented Jul 28, 2019

codecov bot commented Jul 28, 2019 • edited Loading

Codecov Report

richardartoul left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mstump commented Aug 5, 2019

robskillington commented Aug 6, 2019

robskillington commented Aug 6, 2019

robskillington commented Aug 6, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robskillington commented Aug 7, 2019

mstump commented Aug 7, 2019

robskillington commented Aug 7, 2019 • edited Loading

mstump commented Aug 9, 2019 • edited Loading

Screen cap from load test

mstump commented Aug 9, 2019 • edited Loading

mstump commented Aug 9, 2019

mstump commented Aug 9, 2019

richardartoul commented Aug 9, 2019

notbdu commented Oct 9, 2019

codecov bot commented Jul 28, 2019 •

edited

Loading

robskillington commented Aug 7, 2019 •

edited

Loading

mstump commented Aug 9, 2019 •

edited

Loading

mstump commented Aug 9, 2019 •

edited

Loading