Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dbnode] Use ref to segment data for index results instead of alloc each #1839

Merged
merged 38 commits into from
Oct 28, 2019

Conversation

robskillington
Copy link
Collaborator

What this PR does / why we need it:

This greatly reduces the memory allocation usage when querying the index as it ensures the lifetime of the segment survives the query and returns just references to the index data.

Special notes for your reviewer:

Does this PR introduce a user-facing and/or backwards incompatible change?:

NONE

Does this PR require updating code package or user-facing documentation?:

NONE

@codecov
Copy link

codecov bot commented Jul 28, 2019

Codecov Report

Merging #1839 into master will decrease coverage by 9.2%.
The diff coverage is 73.4%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1839      +/-   ##
==========================================
- Coverage    72.8%    63.6%    -9.3%     
==========================================
  Files        1006     1123     +117     
  Lines       86671   107039   +20368     
==========================================
+ Hits        63099    68079    +4980     
- Misses      19360    34603   +15243     
- Partials     4212     4357     +145
Flag Coverage Δ
#aggregator 79.5% <ø> (+16.4%) ⬆️
#cluster 56.4% <ø> (+13.3%) ⬆️
#collector 63.7% <ø> (+22.4%) ⬆️
#dbnode 64.9% <72.9%> (+3.8%) ⬆️
#m3em 59.6% <ø> (-1.8%) ⬇️
#m3ninx 61.2% <96.4%> (-6.6%) ⬇️
#m3nsch 51.1% <ø> (-24.4%) ⬇️
#metrics 17.7% <ø> (-11.7%) ⬇️
#msg 74.7% <ø> (-0.3%) ⬇️
#query 68.6% <ø> (+17.1%) ⬆️
#x 74.7% <71.3%> (+0.3%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9baa7da...cb0ebfb. Read the comment docs.

@robskillington robskillington changed the title [dbnode] Use ref to segment data for index results instead of alloc each WIP [dbnode] Use ref to segment data for index results instead of alloc each Jul 28, 2019
@robskillington robskillington changed the title WIP [dbnode] Use ref to segment data for index results instead of alloc each [dbnode] Use ref to segment data for index results instead of alloc each Jul 29, 2019
Copy link
Contributor

@richardartoul richardartoul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks ok to me although it would be nice if there was a place where you documented this entire lifecycle end-to-end so we don't have to piece it together manually next time we need to understand it

zap.Error(err),
)
}
// if err := bsGauge.UpdateStringList(bootstrappers); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you mean to disable this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is meant to be re-enabled but it was crashing and I needed to test, I'll re-enable when it's fixed on master.

docsPool.Put(batch)
}()

// Register the executor to close when context closes
// so can copy the results.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is a little confusing. I would assume you'd want to register it to the context so you dont have to copy it. Could you maybe clarify?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, I think I mean to say "so we can avoid copying the results". Honestly I hadn't cleaned all this up yet.

@@ -865,10 +871,6 @@ func (b *block) queryWithSpan(
return false, err
}

if err := execCloser.Close(); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol interesting, did we have a double close?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh nah, this got moved to the context closing it when it closes. (SafeCloser wraps the exec close call and prevents it from being double closed, so the two calls existing wasn't an issue).

// the tsID's bytes.
r.resultsMap.Set(tsID, tags)
// It is assumed that the document is valid for the lifetime of the index
// results.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this comment somehow explain or point to the context stuff and an explanation of that?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm trying to find the best place to document that.

}, nil
}

// NB(r): The segment uses the context finalization to finalize
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be a good place to explain the general approach and how it interacts with outstanding queries

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure yeah.

@mstump
Copy link

mstump commented Aug 5, 2019

Is there a container with this patch available on a public repo? I want tot test it out.

@robskillington
Copy link
Collaborator Author

@mstump about to merge and cut release in next few hours 👍

@robskillington
Copy link
Collaborator Author

@mstump let me just build a container for this change.

@robskillington
Copy link
Collaborator Author

@mstump here's a container for this branch:

quay.io/m3db/m3dbnode:index-and-recent-data-zero-copy

Manifest:
https://quay.io/repository/m3db/m3dbnode/manifest/sha256:c974cab45e7da0e7547dc46e84e9988eb8c113e2129566bd129cc3eb6027ee70

func (enc *encoder) Stream(opts encoding.StreamOptions) (xio.SegmentReader, bool) {
segment := enc.segment(byCopyResultType)
func (enc *encoder) Stream(
ctx context.Context,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take it or leave it, but it could be nice to make the zeroCopy behavior optional via the encoding.StreamOptions and put the context in their too so that we don't have to pollute this API (and create a context everytime we want to do this even when we dont need zero copy)

@@ -229,23 +230,24 @@ func (m *merger) Merge(
// Closing the context will finalize the data returned from
// mergeWith.Read(), but is safe because it has already been persisted
// to disk.
tmpCtx.BlockingClose()
// NB(r): Make sure to use BlockingCloseReset so can reuse the context.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isnt the reset handled by the reset right before the call to mergeWith.Read()?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the main thing we avoid using BlockingCloseReset() is to avoid the context going back to the pool. (i.e. the Reset as the suffix means, after closing do not put back in the pool, we explicitly want to reuse this context).

And the main reason need to use a pooled context here is since the context needs to refer to the pooled finalizers linked list now, or else it will allocate a ton in between uses in this tight loop.

@robskillington
Copy link
Collaborator Author

@mstump I published a new container with the latest changes for testing:

quay.io/m3db/m3dbnode:index-and-recent-data-zero-copy-2

@mstump
Copy link

mstump commented Aug 7, 2019

It looks like the container only has the coordinator binary and doesn't have m3dbnode

/ # find / -name 'm3*'
/bin/m3coordinator
/etc/m3coordinator
/etc/m3coordinator/m3coordinator.yml

@robskillington
Copy link
Collaborator Author

robskillington commented Aug 7, 2019

@mstump yeah that is weird, I noticed that. I use the wrong dockerfile as I'd just prepared a coordinator image before.

Here's a working one:

quay.io/m3db/m3dbnode:index-and-recent-data-zero-copy-3

@mstump
Copy link

mstump commented Aug 9, 2019

OK I got it deployed a day or two ago but didn't have time to test until just now.

  • It doesn't consume all available memory as was the case prior to the test but the node will become unresponsive under the load. The node will begin to fail health checks and will be restarted by Kubernetes, taking down the entire cluster. Prometheus stat pulls also fail.
  • CPU Doesn't max out, I'm not sure where the bottleneck is.
  • Memory usage increase is much more gradual than before and didn't reach the 50GB limit.
  • After applying the patch baseline CPU utilization increased from about 1.5 CPU to 2.0 CPUs

The workload is essentially CSV export for all metrics of a cluster at 1m granularity for a 60m time period. It's about 2-3k time series and the query looks like the following.

query_range?query=%7Bcluster_id%3D%22d752ae7a-6801-4a1b-9d00-7be6734ff409%22,+scope%3D%22CLUSTER%22%7D&step=60&end=1562814080&start=1562810480

Screen cap from load test

image

@mstump
Copy link

mstump commented Aug 9, 2019

I should also mention that it's only 10 concurrent queries from this job. There is an additional baseline load resulting in 10k tagged RPC writes per second, and 6 analytical queries per second (7d queries over a 1h aggregated namespace) which are fronted by a dedicated query tier.

@mstump
Copy link

mstump commented Aug 9, 2019

I repeated the test with a single thread of execution and it's still enough to cause nodes to go unresponsive.

@mstump
Copy link

mstump commented Aug 9, 2019

With the 1 concurrent query test I removed load as soon as the nodes started to go unresponsive. The only message of note has to do with etcd. I checked etcd memory and CPU utilization during that time-frame and it was nominal maxing out at 67MB of memory and .015 CPU.

{"level":"error","ts":1565373847.3873844,"msg":"received error on watch channel","error":"etcdserver: no leader"}
{"level":"info","ts":1565373852.6725218,"msg":"finished handling request","rqID":"2e3f3728-ad49-4043-8adf-ca08294a9f2e","time":1565373852.6725035,"response":5.296629915,"url":"/api/v1/placement"}
{"level":"warn","ts":1565373854.0264885,"msg":"etcd watch channel closed on key, recreating a watch channel","key":"_sd.placement/m3db/metrics-cluster/m3db"}

@richardartoul
Copy link
Contributor

@mstump cant comment on the the actual performance since this was Rob's work but the etcd stuff is usually a red herring. Usually it just means the node is under heavy load and as a result connections to etcd get broken and need to be re-established

@notbdu
Copy link
Contributor

notbdu commented Oct 9, 2019

Seeing ~43% reduction in RSS memory when running a side by side comparison of feature to release.

Feature:
feature

Release:
release

@notbdu notbdu force-pushed the r/direct-refs-to-index-data branch from 3674995 to 3e9511b Compare October 24, 2019 18:45
@robskillington robskillington merged commit e6d653b into master Oct 28, 2019
@robskillington robskillington deleted the r/direct-refs-to-index-data branch October 28, 2019 03:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants