Add ActiveLogs() API to commitlog and use it in the CleanupManager #1090

richardartoul · 2018-10-15T21:43:45Z

This P.R allows us to distinguish between corrupt commitlogs and active commitlogs (which sometimes look corrupt because the header info hasn't been flushed yet), allowing us to delete corrupt commitlog files safely.

codecov · 2018-10-16T14:55:07Z

Codecov Report

Merging #1090 into master will decrease coverage by 1.7%.
The diff coverage is 82.5%.

@@           Coverage Diff            @@
##           master   #1090     +/-   ##
========================================
- Coverage    71.4%   69.6%   -1.8%     
========================================
  Files         726     715     -11     
  Lines       60571   60412    -159     
========================================
- Hits        43261   42098   -1163     
- Misses      14556   15641   +1085     
+ Partials     2754    2673     -81

Flag	Coverage Δ
#aggregator	`81.6% <ø> (ø)`	⬆️
#cluster	`84.8% <ø> (-1.3%)`	⬇️
#collector	`78.1% <ø> (ø)`	⬆️
#dbnode	`77.2% <82.5%> (-4%)`	⬇️
#m3em	`73.2% <ø> (ø)`	⬆️
#m3ninx	`71.2% <ø> (-4.2%)`	⬇️
#m3nsch	`51.1% <ø> (ø)`	⬆️
#metrics	`18.3% <ø> (ø)`	⬆️
#msg	`75.1% <ø> (-0.2%)`	⬇️
#query	`65.1% <ø> (+1.5%)`	⬆️
#x	`69.4% <ø> (-5.8%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6c086cf...3af9c80. Read the comment docs.

justinjc · 2018-10-17T19:22:40Z

src/dbnode/persist/fs/commitlog/commit_log.go

@@ -184,6 +189,21 @@ func (l *commitLog) Open() error {
 	return nil
 }

+func (l *commitLog) ActiveLogs() ([]File, error) {


Why have this function return []File? It seems like you return at most one File.

@prateek Was talking about having support for multiple commit log files at some point

prateek · 2018-10-19T16:39:11Z

docs/m3db/architecture/engine.md

@@ -181,7 +181,7 @@ The ticking process runs continously in the background and is responsible for a

 #### Merging all encoders

-M3TSZ is designed for compressing time series data in which each datapoint has a timestamp that is larger than the last encoded datapoint. For monitoring workloads this works very well because every subsequent datapoint is almost always larger than the previous one. However, real world systems are messy and occassionally out of order writes will be received. When this happens, M3DB will allocate a new encoder for the out of order datapoints. The multiple encoders need to be merged before flushing the data to disk, but to prevent huge memory spikes during the flushing process we continuously merge out of order encoders in the background.
+M3TSZ is designed for compressing time series data in which each datapoint has a timestamp that is larger than the last encoded datapoint. For monitoring workloads this works very well because every subsequent datapoint is almost always larger than the previous one. However, real world systems are messy and occasionally out of order writes will be received. When this happens, M3DB will allocate a new encoder for the out of order datapoints. The multiple encoders need to be merged before flushing the data to disk, but to prevent huge memory spikes during the flushing process we continuously merge out of order encoders in the background.


nit: do you want to say almost always chronologically after the previous one instead of almost always larger than the previous one

ps can't tell what changed in this line

prateek · 2018-10-19T16:40:44Z

src/dbnode/integration/disk_cleanup_test.go

@@ -89,6 +88,15 @@ func TestDiskCleanup(t *testing.T) {
 	// and commit logs at now will be deleted
 	newNow := now.Add(retentionPeriod).Add(2 * blockSize)
 	testSetup.setNowFn(newNow)
+	// This isn't great, but right now the commitlog will only ever rotate when writes


nice to see our tests actually catch this kinda thing

prateek · 2018-10-19T16:44:46Z

src/dbnode/storage/cleanup.go


+	// We list the commit log files on disk before we determine what the currently active commitlog


+1 for explanation and example

prateek · 2018-10-19T16:55:08Z

src/dbnode/persist/fs/commitlog/commit_log.go

 	for write := range l.writes {
 		// For writes requiring acks add to pending acks
 		if write.completionFn != nil {
-			l.pendingFlushFns = append(l.pendingFlushFns, write.completionFn)
+			l.flushState.pendingFlushFns = append(l.flushState.pendingFlushFns, write.completionFn)


isn't this racy? i.e. you're modifying a field on flushState without a write Lock on it. but you use a Read lock when accessing the same field on line 326.

prateek · 2018-10-19T17:08:00Z

src/dbnode/persist/fs/commitlog/commit_log.go

-	metrics commitLogMetrics
+type closedState struct {
+	sync.RWMutex
+	closed bool


could you replace the closedState with an atomic. It'll make a lot of the code simpler

We caught up, keeping lock will keep the code simpler as discussed.

prateek · 2018-10-19T17:17:31Z

src/dbnode/persist/fs/commitlog/commit_log.go

@@ -61,30 +61,45 @@ type commitLogFailFn func(err error)
 type completionFn func(err error)

 type commitLog struct {
-	sync.RWMutex
+	flushState  flushState


please add unit test for some of the races

robskillington · 2018-10-19T21:18:25Z

src/dbnode/persist/fs/commitlog/commit_log.go

+	closeErr chan error
+
+	// TODO(r): replace buffered channel with concurrent striped
+	// circular buffer to avoid central write lock contention.


Hm you can probably remove this comment, I think we just need to encode in parallel rather than need a better buffer.

robskillington · 2018-10-22T18:41:39Z

src/dbnode/persist/fs/commitlog/commit_log.go

@@ -184,6 +228,34 @@ func (l *commitLog) Open() error {
 	return nil
 }

+func (l *commitLog) ActiveLogs() ([]File, error) {
+	l.closedState.Lock()
+	defer l.closedState.Unlock()


nit: This just needs to be RLock() and RUnlock() yeah? Don't see it changing the closed state as far as I can tell?

robskillington · 2018-10-22T18:42:41Z

src/dbnode/persist/fs/commitlog/commit_log.go

+	}
+
+	wg.Wait()
+	return []File{file}, err


Do you want to bother doing the if err != nil { return nil, err}; return []File{file}, nil so you only return something if and only if err == nil? Just slightly more idiomatic Go to do so.

yeah thats fair, Its kind of messed up to return a non-empty slice in the error case

robskillington · 2018-10-22T18:57:34Z

src/dbnode/persist/fs/commitlog/commit_log.go

@@ -58,33 +58,72 @@ type writeCommitLogFn func(
 ) error
 type commitLogFailFn func(err error)

-type completionFn func(err error)
+type valueTypeFn func(f File, err error)


For better reuse, which we'll need in the future, can we just rename this to callbackFn?

And maybe the first arg it would be better to be some composite event struct, so we can add cleanly to it in the future:

func callbackFn func(result callbackResult) type callbackResultType uint const ( activeLogsCallback callbackResultType = iota // ... more in the future ) type callbackResult struct { resultType callbackResultType err error activeLogs activeLogsCallbackResult // to be used if and only if eventType == activeLogsResultCallbackEventType } type activeLogsCallbackResult struct { file *File } func (r callbackResult) activeLogsCallbackResult() (activeLogsCallbackResult, error) { if expectedType := r.resultType; expectedType != activeLogsCallback { return activeLogsCallbackResult{}, fmt.Errorf("wrong result type: expected=%d, actual=%d", expectedType, r.resultType) } if r.err != nil { return activeLogsCallbackResult{}, err } return r.activeLogs, nil } // Now from calling code: func foo() { var ( result activeLogsCallbackResult err error ) writes <- commitLogWrite{ valueType: activeLogsValueType, callbackFn: func(r callbackResult) { result, err = r.activeLogsCallbackResult() wg.Done() }, } wg.Wait() if err != nil { return nil, err } return result.file, nil }

Yeah I can make this change, the reason I didn't is that the Rotate() API will need pretty much the same result as the ActiveLogs one but this is fine too

resultType seems a little overkill, I think I can just repurpose the valueType for now to infer the result type

resultType seems a little overkill, I think I can just repurpose the valueType for now to infer the result type.

robskillington · 2018-10-22T19:01:09Z

src/dbnode/persist/fs/commitlog/writer.go

-	// Flush will flush the contents to the disk, useful when first testing if first commit log is writable
+	// Sync will ensure that all writes that have been issued to the writer have been
+	// FSync'd to disk.
+	Sync() error


Perhaps FlushAndSync() considering that's what it's doing?

I want to stop using the word Flush entirely in the public interface (because its a little non-sensical, in this case "flush" means "flushed to chunk writer" whereas all an external caller will care about is "FSync'd to disk")

Getting rid of the flush method is outside the scope of this P.R but I'd rather not put it in this method name

robskillington

LGTM once build passing

richardartoul changed the title ~~[WIP] - Add ActiveLogs() API to commitlog~~ Add ActiveLogs() API to commitlog Oct 16, 2018

richardartoul changed the title ~~Add ActiveLogs() API to commitlog~~ [WIP] - Add ActiveLogs() API to commitlog Oct 16, 2018

richardartoul changed the title ~~[WIP] - Add ActiveLogs() API to commitlog~~ Add ActiveLogs() API to commitlog and expose it to the CleanupManager Oct 16, 2018

richardartoul mentioned this pull request Oct 16, 2018

Add ability to distinguish between active commit log file and all other commit log files in CleanupManager #1078

Closed

richardartoul force-pushed the ra/active-log branch 3 times, most recently from 67379f7 to 920e57d Compare October 16, 2018 22:09

richardartoul changed the title ~~Add ActiveLogs() API to commitlog and expose it to the CleanupManager~~ Add ActiveLogs() API to commitlog and use it in the CleanupManager Oct 16, 2018

richardartoul requested review from prateek, robskillington and justinjc October 16, 2018 22:17

richardartoul force-pushed the ra/active-log branch from 920e57d to 227de8b Compare October 17, 2018 14:57

justinjc reviewed Oct 17, 2018

View reviewed changes

prateek reviewed Oct 19, 2018

View reviewed changes

robskillington reviewed Oct 19, 2018

View reviewed changes

richardartoul force-pushed the ra/active-log branch from 25a530f to 4d409bf Compare October 22, 2018 16:41

robskillington reviewed Oct 22, 2018

View reviewed changes

robskillington approved these changes Oct 22, 2018

View reviewed changes

Richard Artoul added 3 commits October 23, 2018 10:48

Add ActiveLogs() API to commitlog

0609e34

regen mocks

79a4280

Remove unused error

67b2fbd

Richard Artoul added 23 commits October 23, 2018 10:48

Fix flaky test

58656ab

Refactor locking to be more granular and organized

1434251

Add period to comment

58a7ccf

use defer for unlock

2f54ec8

Add comment about ordering of function calls

0f62824

Fix typos

4334c27

Add sync API

471cd6f

move pendingFlushesFn out of substruct

de40a0e

More refactoring

6f156d6

restore flushState

98f1034

Add comment and test

72f03e7

make prop test big

4b31765

improve comment

a041794

Fix docs

696e877

Dont use so many locks

9bdeb29

Fix comment

1ead851

remove comment

81e66ad

Call wg.Done() ever for errors

db7f1c3

remove lock

4071b0e

refactor comment

e6e8f19

Addresss feedback

d5d1543

reorder ifs

e65f16e

Remove sync API

3d65fce

richardartoul force-pushed the ra/active-log branch from 07f2180 to 3d65fce Compare October 23, 2018 14:48

Richard Artoul added 4 commits October 23, 2018 12:02

skip conc test

967dde0

mark conc test as big

aba86c0

Fix import order

5102e4a

Fix broken test

3af9c80

richardartoul merged commit a6b8c2a into master Oct 23, 2018

justinjc deleted the ra/active-log branch October 30, 2018 20:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ActiveLogs() API to commitlog and use it in the CleanupManager #1090

Add ActiveLogs() API to commitlog and use it in the CleanupManager #1090

richardartoul commented Oct 15, 2018 •

edited

Loading

codecov bot commented Oct 16, 2018 •

edited

Loading

justinjc Oct 17, 2018

richardartoul Oct 18, 2018

prateek Oct 19, 2018

prateek Oct 19, 2018

prateek Oct 19, 2018

prateek Oct 19, 2018

prateek Oct 19, 2018

prateek Oct 19, 2018

robskillington Oct 19, 2018

prateek Oct 19, 2018

robskillington Oct 19, 2018

robskillington Oct 22, 2018

richardartoul Oct 22, 2018

robskillington Oct 22, 2018

richardartoul Oct 22, 2018

robskillington Oct 22, 2018 •

edited

Loading

richardartoul Oct 22, 2018

richardartoul Oct 22, 2018

richardartoul Oct 22, 2018

richardartoul Oct 22, 2018

robskillington Oct 22, 2018

richardartoul Oct 22, 2018

robskillington left a comment •

edited

Loading


		// We list the commit log files on disk before we determine what the currently active commitlog

Add ActiveLogs() API to commitlog and use it in the CleanupManager #1090

Add ActiveLogs() API to commitlog and use it in the CleanupManager #1090

Conversation

richardartoul commented Oct 15, 2018 • edited Loading

codecov bot commented Oct 16, 2018 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robskillington Oct 22, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robskillington left a comment • edited Loading

Choose a reason for hiding this comment

richardartoul commented Oct 15, 2018 •

edited

Loading

codecov bot commented Oct 16, 2018 •

edited

Loading

robskillington Oct 22, 2018 •

edited

Loading

robskillington left a comment •

edited

Loading