Handle commit log files with corrupt info headers during cleanup and bootstrap #1066

richardartoul · 2018-10-10T21:48:18Z

Automatically delete commit log files with corrupt info headers during cleanup instead of erroring out forever and requiring manual intervention.
Add static configuration for whether or not commitlog bootstrapping corruption should return unfulfilled or not for the whole thing (default to return unfulfilled)
Always read all the commit log files we intended to read anyways, regardless of that config.
Auto-detect in commitlog bootstrapper if it is possible for the peers bootstrapper to satisfy requests. If not, return fulfilled instead of unfulfilled because the peers bootstrapper cant perform a repair anyways.
Don't return errors from the commitlog bootstrapper for corrupt files
Emit logs for all encountered corrupted files
Emit metrics for all encountered corrupt files

codecov · 2018-10-10T22:08:45Z

Codecov Report

Merging #1066 into master will increase coverage by 2.03%.
The diff coverage is 82.44%.

@@            Coverage Diff             @@
##           master    #1066      +/-   ##
==========================================
+ Coverage    75.3%   77.34%   +2.03%     
==========================================
  Files         569      578       +9     
  Lines       48213    48590     +377     
==========================================
+ Hits        36307    37580    +1273     
+ Misses       9640     8642     -998     
- Partials     2266     2368     +102

Flag	Coverage Δ
#aggregator	`81.59% <ø> (ø)`	⬆️
#collector	`59.23% <ø> (ø)`	⬆️
#dbnode	`81.39% <82.44%> (+3.89%)`	⬆️
#m3em	`73.21% <ø> (ø)`	⬆️
#m3ninx	`75.25% <ø> (+4.02%)`	⬆️
#m3nsch	`51.19% <ø> (ø)`	⬆️
#msg	`74.98% <ø> (ø)`	⬆️
#query	`63.67% <ø> (-1.65%)`	⬇️
#x	`75.1% <ø> (+5.74%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2eb71f7...b8020fe. Read the comment docs.

prateek · 2018-10-11T02:27:12Z

src/dbnode/persist/fs/commitlog/files.go

 // File represents a commit log file and its associated metadata.
 type File struct {
 	FilePath string
 	Start    time.Time
 	Duration time.Duration
 	Index    int64
+	// Contains any errors encountered when trying to read the commitlogs file info. We


super nit: mind making a union type instead of this just to be clear to users. something like:

type File struct { FilePath string Start time.Time Duration time.Duration Index int64 } type FileOrError struct { Path string File File Error error }

Sure, I could also do something like:

type FileOrError struct { file File err Error } func (f *FileOrError) FIle() (File, Error) { return f.file, f.err }

And then its basically impossible to misuse it

i don't like that people could use the Path regardless of error in that scheme – (e.g. the cleanup manager still has to delete the file regardless of err). In addition to what you suggested, could you expose a typed error with the path available to get around that?

Just to clarify, are you saying you don’t like the idea of someone receiving an error but then still needing to access the File object to get at the path, so we should export an error struct that include the file path, correct? Yeah that seems better to me too

Yep, exactly.

prateek · 2018-10-11T02:35:49Z

src/dbnode/storage/bootstrap/bootstrapper/commitlog/source.go

@@ -782,6 +782,14 @@ func (s *commitLogSource) newReadCommitLogPred(
 	// that has a snapshot more recent than the global minimum. If we use an array for fast-access this could
 	// be a small win in terms of memory utilization.
 	return func(f commitlog.File) bool {
+		if f.Error != nil {


We can't skip corrupt commit log files like they didn't exist w/o risking data loss. It'd be much safer to see if any corrupt commit logs exist and if so fall back to the peers bootstrapper (in the case there are peers), and if there are no peers allow users to opt in to skip corrupt commit log files but don't do it by default.

I'm ok with making this opt-outable for users who really care about data integrity, but I think by default it should skip them. If you can't even read the info part of the commitlog, then (barring true corruption of the filesystem unrelated to M3DB shutdown) your commitlog isn't missing any data anyways other than the standard amount you would be missing due to the writebehind strategy. I don't think this is a situation where we need to involve operators in the general case.

by putting the commitlog bootstrapper before peers bootstrapper you're basically accepting losing the last few writes that occurred right before shutdown due to the writebehind strategy (as well as any writes that were issued while you were down) anyways.

In my mind we should emit a log, skip over bad commitlog files, and then eventually address this issue with the corruption APIs / a background repair process.

The other reason I think it sucks to push this burden down to the operators is that basically by the time this happen and you realize whats going on, there are only two options:

Disable commitlog bootstrapper and replace it with peers bootstrapper, which is fine, but only viable in non-catastrophic scenarios where all the other nodes are up.

Go and delete the bad commitlog file and then restart with the commitlog bootstrapper, at which point you've probably written out enough commit log files that what should have been 6->8 minutes of downtime turns into an hour-long ordeal.

For the peers case: how about just returning returning all requested shard ranges as unfullfilled in the commit log bootstrapper (won't need any interface changes from the way things are now) when the commit log runs into an error and that way we'll fall back to peers automatically.

For the single node case: we can have a new RunOption to control the behaviour of whether to skip or not and the bootstrap process can be configured from the top level to specify what to pass it. That way the single node config we put out can default to come up but we can put a WARNING next to it and in docs.

Spoke offline and settled on:

1. Add static configuration for whether or not commitlog bootstrapping corruption should return unfulfilled or not for the whole thing (default to return unfulfilled) 2. Always read all the commit log files we intended to read anyways, regardless of that config. 3. Add exception in commitlog bootstrap for single node deployment which will detect that this is the only node in the placement and default to returning fulfilled because there is nothing else you could possibly do to recover data 4. Don't return errors from the commitlog bootstrapper for corrupt files 5. Emit metrics / logs for all encountered corrupt files.

This gives us a good mix between maintaining data integrity as much as we can, but also not imposing undue operational burden.

In single node failures, the issue will just resolve itself.
In a single node failure within a multi-node cluster, the issue will just resolve itself.
In a multi-node failure within a multi-node cluster, some nodes with corrupt commitlog files may get stuck in the peer bootstrapping phase, but they can be "released" with a K.V change to the peer bootstrapper consistency level and come online without requiring an additional costly restart.

prateek · 2018-10-13T00:28:25Z

src/dbnode/persist/fs/commitlog/files.go

@@ -30,6 +30,61 @@ import (
 	"github.com/m3db/m3/src/dbnode/persist/fs/msgpack"
 )

+type fsError struct {


nit move these error types to the bottom of the file

prateek · 2018-10-13T00:28:57Z

src/dbnode/persist/fs/commitlog/files.go

@@ -38,6 +93,13 @@ type File struct {
 	Index    int64
 }

+func newErrorWithPath(err error, path string) ErrorWithPath {


same for this ctor, move it next to the associated type at the bottom of the file

prateek · 2018-10-13T00:32:05Z

src/dbnode/persist/fs/commitlog/files.go

 	}

 	sort.Slice(commitLogFiles, func(i, j int) bool {
-		return commitLogFiles[i].Start.Before(commitLogFiles[j].Start)
+		// Sorting is best effort here since we may not know the start.


why sort at all? and if you're going to sort - you can sort errors first (or last) and then guarantee order too.

or better yet change the return type for this function: ([]File, []ErrorWithPath, error) and then won't need the FileOrError type.

Sorting is important because you want to try and read the commit log files in order in the bootstrapper

Your interface change suggestion is good though, just made it

prateek · 2018-10-13T00:32:23Z

src/dbnode/persist/fs/commitlog/files_test.go

@@ -38,27 +38,30 @@ import (
 )

 func TestFiles(t *testing.T) {
-	// TODO(r): Find some time/people to help investigate this flakey test.


lol is this no longer flaky?

yeah I fixed it

prateek · 2018-10-13T00:32:53Z

src/dbnode/persist/fs/commitlog/files_test.go

 	opts = opts.SetFilesystemOptions(
 		opts.FilesystemOptions().
 			SetFilePathPrefix(dir),
 	)
 	files, err := Files(opts)
 	require.NoError(t, err)
-	require.Equal(t, 5, len(files))
+	require.True(t, len(files) >= minNumBlocks)


ah is this the fix?

prateek · 2018-10-13T00:34:33Z

src/dbnode/persist/fs/commitlog/iterator.go

-func filterFiles(opts Options, files []File, predicate FileFilterPredicate) []File {
-	filteredFiles := make([]File, 0, len(files))
-	for _, f := range files {
+func filterFiles(opts Options, files []FileOrError, predicate FileFilterPredicate) ([]File, []ErrorWithPath) {


as above, can move this to be used within the Files() method.

prateek · 2018-10-13T00:35:09Z

src/dbnode/storage/bootstrap/bootstrapper/commitlog/source.go

+	}
+}
+
+type commitLogSourceMetrics struct {


huge +1 to this.

prateek · 2018-10-13T00:37:53Z

src/dbnode/storage/bootstrap/bootstrapper/commitlog/source.go

@@ -275,6 +323,7 @@ func (s *commitLogSource) ReadData(
 	}

 	// Read / M3TSZ encode all the datapoints in the commit log that we need to read.
+	s.metrics.data.readingCommitlogs.Update(1)


could you do same thing we normally do - use an atomic and have a background routine emitting the guage regularly. that way you'd get more than a single datapoint indicating what's going on.

prateek · 2018-10-13T00:38:33Z

src/dbnode/storage/bootstrap/bootstrapper/commitlog/source.go

@@ -682,7 +751,7 @@ func (s *commitLogSource) newReadCommitLogPredBasedOnAvailableSnapshotFiles(
 	shardsTimeRanges result.ShardTimeRanges,
 	snapshotFilesByShard map[uint32]fs.FileSetFilesSlice,
 ) (
-	func(f commitlog.File) bool,
+	commitlog.FileFilterPredicate,


+1 for type safety

prateek · 2018-10-13T00:39:23Z

src/dbnode/storage/bootstrap/bootstrapper/commitlog/source.go

@@ -1355,6 +1433,17 @@ func (s *commitLogSource) ReadIndex(
 			indexResults, indexOptions, indexBlockSize, resultOptions)
 	}

+	if iterErr := iter.Err(); iterErr != nil {
+		// Log the error and mark that we encountered corrupt data, but don't


lol what i wouldn't give for macros.

prateek · 2018-10-13T00:41:48Z

src/dbnode/storage/bootstrap/bootstrapper/commitlog/source.go

+		return false, nil
+	}
+
+	couldObtainDataFromPeers, err := s.couldObtainDataFromPeers(


hm - i thought you were going to check if it's a single node for this case. why use the more extended peer condition?

Seemed like it was slightly better to not hand off to the peers bootstrapper if it can't help anyways. Think I should remove it?

prateek · 2018-10-13T00:42:17Z

src/dbnode/storage/cleanup.go

+				"encountered err: %v reading commit log file: %v info during cleanup, marking file for deletion",
+				errorWithPath.Error(), errorWithPath.Path())
+
+			// TODO(rartoul): Leave this out until we have a way of distinguishing between a corrupt commit


robskillington · 2018-10-15T22:52:13Z

src/dbnode/persist/fs/commitlog/files_test.go

@@ -38,27 +38,30 @@ import (
 )

 func TestFiles(t *testing.T) {
-	// TODO(r): Find some time/people to help investigate this flakey test.


robskillington · 2018-10-15T22:55:05Z

src/dbnode/storage/bootstrap/bootstrapper/commitlog/source.go

+func newCommitLogSourceDataAndIndexMetrics(scope tally.Scope) commitLogSourceDataAndIndexMetrics {
+	return commitLogSourceDataAndIndexMetrics{
+		data:  newCommitLogSourceMetrics(scope),
+		index: newCommitLogSourceMetrics(scope),


Can you add a different tag for each one? Otherwise they'll overwrite each other's metrics.

e.g.

data: newCommitLogSourceMetrics(scope.Tagged(map[string]string{"source_type": "data"})), index: newCommitLogSourceMetrics(scope.Tagged(map[string]string{"source_type": "index"})),

oh duh, thanks

robskillington · 2018-10-15T22:56:21Z

src/dbnode/storage/bootstrap/bootstrapper/commitlog/source.go

+	return m.gaugeLoop(m.mergingSnapshotsAndCommitlogs)
+}
+
+func (m commitLogSourceMetrics) gaugeLoop(g tally.Gauge) func() {


Can you type-ify this function so it's a little cleaner, e.g.:

type closerFn func() func (m commitLogSourceMetrics) emitReadingSnapshots() closerFn { return m.gaugeLoop(m.readingSnapshots) }

robskillington · 2018-10-15T23:00:47Z

src/dbnode/storage/bootstrap/bootstrapper/commitlog/source.go

@@ -300,9 +379,17 @@ func (s *commitLogSource) ReadData(
 			blockStart: dp.Timestamp.Truncate(blockSize),
 		}
 	}
+	doneReadingCommitlogs()


Is it a waste to just emit the gauges the whole time this function is running? Then we could do it all at the start and defer all these cleanups. It would be better mainly because it would defend against an early return not stopping the loop which would cause a memory leak (for loop continuing to hold ref to the commit log source forever.

see comment below

robskillington · 2018-10-15T23:01:47Z

src/dbnode/storage/bootstrap/bootstrapper/commitlog/source.go

@@ -1319,11 +1421,13 @@ func (s *commitLogSource) ReadIndex(
 	)

 	// Start by reading any available snapshot files.
+	doneReadingSnapshots := s.metrics.index.emitReadingSnapshots()


If we could do this with a defer somehow that would also be nicer, is there some state it can just read for the whole execution of the function and emit one or zero instead of needing to explicitly emit based on control flow?

Yeah there isn't any obvious state we could read to make a decision (any state we could read would have to be added based on control flow.)

I think we either leave it as is, or we can simplify it by not distinguishing between reading commitlogs and snapshotting and just having a gauge for "commitlog bootstrapping data" and "commitlog bootstrapping index"

honestly, making the distinction between commitlogs and snapshots is not worth it because once I land the perf improvement reading snapshots will take like 20-30 seconds so not worth separating the two. I'll make the change

robskillington · 2018-10-15T23:04:02Z

src/dbnode/storage/bootstrap/bootstrapper/commitlog/source.go

+	}
+
+	if shouldReturnUnfulfilled {
+		indexResult.SetUnfulfilled(shardsTimeRanges)


We still return our data right? (I know this is likely true, just want to verify). That way we can join whatever we had at least.

Yeah this should have no effect on what data we return (pretty sure my prop test still verifies the data is there) so as long as the caller doesn't ignore the data it will get merged in

robskillington · 2018-10-15T23:06:05Z

src/dbnode/storage/cleanup_test.go

+
+func TestCleanupManagerDeletesCorruptCommitLogFiles(t *testing.T) {
+	// TODO(rartoul): Re-enable this once https://github.com/m3db/m3/issues/1078
+	// is resolved.


Can you add a note to the issue to "be sure to re-enable test TestCleanupManagerDeletesCorruptCommitLogFiles with the same PR that fixes this issue".

robskillington · 2018-10-16T18:28:41Z

src/dbnode/storage/bootstrap/bootstrapper/commitlog/source.go

-		fsOpts         = s.opts.CommitLogOptions().FilesystemOptions()
-		filePathPrefix = fsOpts.FilePathPrefix()
+		// Emit bootstrapping gauge for duration of ReadData
+		doneBootstrapping      = s.metrics.data.emitBootstrapping


Hm, did you mean this?:

doneBootstrapping = s.metrics.data.emitBootstrapping()

Re: why didn't this break, as a meta point if we wanted to be more "type safe" about it we could make it return an io.Closer than calls the method.

import "github.com/m3db/m3x/close" var _ close.Closer = closerFn(nil) // Implement close.Closer func (f closerFn) Close() { f() } // Now to call it func myThing() close.Closer { return closerFn(func() { /* code */}) }

Fixed the function call. I get what you're saying, but seems like it really hurts readability for something that would have been caught by a test if it was really important (we don't currently test whether metrics are being emitted properly.) I think I'm just gonna make the fix and leave as is for now

robskillington

Other than final comment, LGTM

richardartoul requested review from justinjc, schallert, robskillington and prateek October 10, 2018 21:48

richardartoul changed the title ~~Handle commit log files with corrupt info headers during cleanup~~ Handle commit log files with corrupt info headers during cleanup and bootstrap Oct 10, 2018

prateek reviewed Oct 11, 2018

View reviewed changes

richardartoul changed the title ~~Handle commit log files with corrupt info headers during cleanup and bootstrap~~ [WIP] - Handle commit log files with corrupt info headers during cleanup and bootstrap Oct 11, 2018

richardartoul changed the title ~~[WIP] - Handle commit log files with corrupt info headers during cleanup and bootstrap~~ Handle commit log files with corrupt info headers during cleanup and bootstrap Oct 12, 2018

richardartoul force-pushed the ra/fix-commitlog-cleanup branch from 11a421c to d611d13 Compare October 12, 2018 21:01

richardartoul mentioned this pull request Oct 12, 2018

Add ability to distinguish between active commit log file and all other commit log files in CleanupManager #1078

Closed

prateek reviewed Oct 13, 2018

View reviewed changes

richardartoul force-pushed the ra/fix-commitlog-cleanup branch from 7b486be to 06f6e97 Compare October 15, 2018 16:24

Richard Artoul added 5 commits October 15, 2018 14:45

Handle commit log files with corrupt info headers during cleanup

8cf4917

Rename openError to fsError

949025c

Clarify comment

9674341

Fix typo

f0ad534

Skip corrupt commit log files in the commitlog bootstrapper

16827fe

Richard Artoul added 18 commits October 15, 2018 14:45

Fix broken test

9883df1

Add metrics for cleanup commitlogs

d4aacfe

Fix compulation error

c3de970

Add metrics for commitlog bootstrapper

c976906

Fix imports

ce9beca

More tweaks

02ce8e4

Fix broken test

07284d5

Fix broken test

979ecf7

Disable deletion of corrupt commit log files temporarily

d5a159b

Move stuff to bottom of file

18e405e

Refactor files interface

7a94eaf

Refactor code to use new interface

d5e10e6

Fix tests

48b671a

Add function comment

c300a20

emit gauges in loop

a380187

simplify couldObtainDataFromPeers logic

d490cde

improve logic and comment

6e167e2

mark prop test as large

9a6849c

richardartoul force-pushed the ra/fix-commitlog-cleanup branch from d61f17a to 9a6849c Compare October 15, 2018 18:45

robskillington reviewed Oct 15, 2018

View reviewed changes

Richard Artoul added 4 commits October 16, 2018 10:21

Address feedback

fa4b0c3

Fix subscopes

926308c

Fix import order

656df6e

fix flaky test

da67bf7

richardartoul mentioned this pull request Oct 16, 2018

Invalid commit log files can hinder cleanup and bootstrapping #966

Closed

robskillington reviewed Oct 16, 2018

View reviewed changes

robskillington approved these changes Oct 16, 2018

View reviewed changes

Fix broken metric

b8020fe

richardartoul merged commit f42a475 into master Oct 16, 2018

justinjc deleted the ra/fix-commitlog-cleanup branch October 30, 2018 20:48

Handle commit log files with corrupt info headers during cleanup and bootstrap #1066

Handle commit log files with corrupt info headers during cleanup and bootstrap #1066

Conversation

richardartoul commented Oct 10, 2018 • edited Loading

codecov bot commented Oct 10, 2018 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

prateek Oct 11, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

prateek Oct 13, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

prateek Oct 13, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robskillington left a comment

Choose a reason for hiding this comment

richardartoul commented Oct 10, 2018 •

edited

Loading

codecov bot commented Oct 10, 2018 •

edited

Loading

prateek Oct 11, 2018 •

edited

Loading

prateek Oct 13, 2018 •

edited

Loading

prateek Oct 13, 2018 •

edited

Loading