[aggregator] Raw TCP Client write queueing/buffering refactor #3342

vdarulis · 2021-03-09T05:06:59Z

What this PR does / why we need it:

Fixes a long-standing bug, where logic in prepareEnqueueBufferWithLock meant it was possible to NOT flush metrics, until there was enough traffic to a shard to reach $flushSize. It caused weird side effects, where metrics are emitted at a very inconsistent interval if the volume was small enough.
Discern between protobuf payload size limits and data written into network connection: we want to limit the former, but we should write as much as possible. As a result, batching logic now lives in Writer, not Queue.
No more fixed-size buffers and idle goroutines wasting memory between writes
Flush to network only on when client requests a flush - this prevents from writing small payloads in the middle of write cycle/between flushes.

Special notes for your reviewer:

Does this PR introduce a user-facing and/or backwards incompatible change?:

Does this PR require updating code package or user-facing documentation?:

codecov · 2021-03-09T08:57:22Z

Codecov Report

Merging #3342 (b75e29c) into master (ffdce8e) will decrease coverage by 0.0%.
The diff coverage is 82.1%.

@@            Coverage Diff            @@
##           master    #3342     +/-   ##
=========================================
- Coverage    72.4%    72.4%   -0.1%     
=========================================
  Files        1098     1098             
  Lines      101927   101864     -63     
=========================================
- Hits        73897    73825     -72     
+ Misses      22955    22952      -3     
- Partials     5075     5087     +12

Flag	Coverage Δ
aggregator	`76.7% <82.1%> (+<0.1%)`	⬆️
cluster	`84.9% <ø> (-0.2%)`	⬇️
collector	`84.3% <ø> (ø)`
dbnode	`78.9% <ø> (-0.1%)`	⬇️
m3em	`74.4% <ø> (ø)`
m3ninx	`73.6% <ø> (+<0.1%)`	⬆️
metrics	`19.8% <ø> (ø)`
msg	`74.5% <ø> (-0.1%)`	⬇️
query	`67.3% <ø> (ø)`
x	`80.4% <ø> (+<0.1%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ffdce8e...2788fbb. Read the comment docs.

abliqo · 2021-03-09T14:26:11Z

src/aggregator/client/config.go

@@ -52,10 +52,10 @@ type Configuration struct {
 	ShardCutoverWarmupDuration *time.Duration                  `yaml:"shardCutoverWarmupDuration"`
 	ShardCutoffLingerDuration  *time.Duration                  `yaml:"shardCutoffLingerDuration"`
 	Encoder                    EncoderConfiguration            `yaml:"encoder"`
-	FlushSize                  int                             `yaml:"flushSize"`


Wouldn't that break parsing of existing yaml configs where this field is present? Maybe leave it with omitempty annotation?

kept the old one w/ deprecated comment

abliqo · 2021-03-09T14:31:28Z

src/aggregator/client/queue.go

 	)
+
+	// Round up queue size to power of 2.


The code seems to be rounding down the number rather than rounding up :)

Given 3 ops in a row, might worth extracting it into a single-line function with a single "sanity" unit test.

This will round up as expected.

done, it def rounds up :)

abliqo · 2021-03-09T14:44:28Z

src/aggregator/client/queue.go

-		case <-q.doneCh:
-			return
-		}
+	if cap(*buf) < _queueMaxWriteBufSize {


Ultra-nit: use len instead of cap for consistency?

abliqo · 2021-03-09T14:47:50Z

src/aggregator/client/queue.go

-		case <-q.doneCh:
-			return
-		}
+	if cap(*buf) < _queueMaxWriteBufSize {


Don't we want to reset the buffer upon successful write regardless of any checks?

Not immediately clear what the goal is here - we don't want to pool slices we've expanded past the max buffer size? Is this just to minimize unexpected memory growth and keep things bounded? Can you add a quick comment?

mway · 2021-03-09T16:14:42Z

src/aggregator/client/queue.go

 	)
+
+	// Round up queue size to power of 2.


This will round up as expected.

mway · 2021-03-09T16:16:31Z

src/aggregator/client/queue.go

+		b := q.buf.shift()
+
+		bytes := b.Bytes()
+		if bytes == nil {


nit: can you do a len check here instead? not sure if it's possible to have a zero-length, non-nil slice returned.

yes, good idea, this relies too much on protobuf.Buffers implementation - it doesn't have an easier way to test if it's a zero value.

mway · 2021-03-09T16:19:48Z

src/aggregator/client/queue.go

-		case <-q.doneCh:
-			return
-		}
+	if cap(*buf) < _queueMaxWriteBufSize {


Not immediately clear what the goal is here - we don't want to pool slices we've expanded past the max buffer size? Is this just to minimize unexpected memory growth and keep things bounded? Can you add a quick comment?

abliqo · 2021-03-09T17:35:02Z

src/aggregator/client/queue.go

-		q.writeAndReset()
-		lastDrain = time.Now()
+	// Check buffer capacity, not length, to make sure we're not pooling slices that are too large.
+	// Otherwise, it could in multi-megabyte slices hanging around, in case we get a spike in writes.


ultra-nit: "... could result in ...". Thanks for comment btw

* master: [dbnode] Remove unused shardBlockVolume (#3347) Fix new Go 1.15+ vet check failures (#3345) [coordinator] Add config option to make rollup rules untimed (#3343) [aggregator] Raw TCP Client write queueing/buffering refactor (#3342) [dbnode] Fail M3TSZ encoding on DeltaOfDelta overflow (#3329)

vdarulis added 2 commits March 9, 2021 00:17

[aggregator] Raw TCP Client write queueing/buffering refactor

0e7af40

fix debug leftover

a2f4d6e

vdarulis force-pushed the v/tcpclient branch from ca77fce to a2f4d6e Compare March 9, 2021 05:17

cleanup

615617d

vdarulis requested a review from mway March 9, 2021 06:33

vdarulis added 3 commits March 9, 2021 02:00

lint

da04c7c

.

7c28154

force flushes in coordinator

356afdb

abliqo reviewed Mar 9, 2021

View reviewed changes

mway approved these changes Mar 9, 2021

View reviewed changes

vdarulis added 4 commits March 9, 2021 11:28

feedback

b118f79

fix tests

1e4ad67

go mod tidy

7617a42

fix

b75e29c

abliqo reviewed Mar 9, 2021

View reviewed changes

abliqo approved these changes Mar 9, 2021

View reviewed changes

vdarulis added 3 commits March 9, 2021 13:20

only flush writers that were marked as dirty

78d8409

fix comment

2788fbb

go mod tidy

ab5248b

vdarulis merged commit 079ac7d into master Mar 9, 2021

vdarulis deleted the v/tcpclient branch March 9, 2021 18:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[aggregator] Raw TCP Client write queueing/buffering refactor #3342

[aggregator] Raw TCP Client write queueing/buffering refactor #3342

vdarulis commented Mar 9, 2021

codecov bot commented Mar 9, 2021 •

edited

Loading

abliqo Mar 9, 2021

vdarulis Mar 9, 2021

abliqo Mar 9, 2021

abliqo Mar 9, 2021

mway Mar 9, 2021

vdarulis Mar 9, 2021

abliqo Mar 9, 2021

abliqo Mar 9, 2021

mway Mar 9, 2021

mway Mar 9, 2021

mway Mar 9, 2021

vdarulis Mar 9, 2021

mway Mar 9, 2021

abliqo Mar 9, 2021

[aggregator] Raw TCP Client write queueing/buffering refactor #3342

[aggregator] Raw TCP Client write queueing/buffering refactor #3342

Conversation

vdarulis commented Mar 9, 2021

codecov bot commented Mar 9, 2021 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Mar 9, 2021 •

edited

Loading