Fix deadlock when flushing discard stats. #976

poonai · 2019-08-09T10:31:12Z

Signed-off-by: பாலாஜி ஜின்னா balaji@dgraph.io

Fixes - #970 and #976

This change is

Signed-off-by: பாலாஜி ஜின்னா <balaji@dgraph.io>

pullrequest

✅ A review job has been created and sent to the PullRequest network.

Check the status or cancel PullRequest code review here.

jarifibrahim

@Sch00lb0y We don't need this patch. Look at

badger/value.go

Lines 1392 to 1397 in 9d7b751

    
           if vlog.lfDiscardStats.updatesSinceFlush > discardStatsFlushThreshold { 
        
           	if err := vlog.flushDiscardStats(); err != nil { 
        
           		return err 
        
           	} 
        
           	vlog.lfDiscardStats.updatesSinceFlush = 0 
        
           }

After flushing, we reset the update count
But over here

badger/db.go

Lines 361 to 369 in 9d7b751

    
           func (db *DB) close() (err error) { 
        
           	db.elog.Printf("Closing database") 
        
           	if err := db.vlog.flushDiscardStats(); err != nil { 
        
           		return errors.Wrap(err, "failed to flush discard stats") 
        
           	} 
        
           	atomic.StoreInt32(&db.blockWrites, 1)

We don't reset the counter. That's why compaction triggered another flush and the db crashed.

Fixing the reset counter should fix the issue. Also, try to write a test for it.

Reviewable status: 0 of 1 files reviewed, all discussions resolved (waiting on @ashish-goswami, @jarifibrahim, and @manishrjain)

pullrequest

I've reviewed the ticket and sch00lb0y feedback. This appears to be the right approach.

Reviewed with ❤️ by PullRequest

Signed-off-by: பாலாஜி ஜின்னா <balaji@dgraph.io>

pullrequest

✖️ This code review was cancelled. See Details

jarifibrahim

Reviewable status: 0 of 2 files reviewed, 2 unresolved discussions (waiting on @ashish-goswami, @balajijinnah, and @manishrjain)

value.go, line 1418 at r2 (raw file):

		// So ignoring it.
		// https://github.com/dgraph-io/badger/issues/970
		if err != ErrBlockedWrites {

We don't need this. The BlockedWrites was being returned because the updateSinceFlush was not being reset.
Once we've reset it, the writes will never be triggered.

Here's what was happening in the existing implementation

DB was closed
Discard stats were flushed (but the update counter was not reset)
L0 compaction was triggered
The compaction tried to update the discard stats but since the update counter was on the edge of the threshold, the compaction caused a flush.
This flush failed because the write channel was closed.

The write was triggered because the update counter was not being reset. Since you're resetting the counter now, we won't be flushing the discard stats when L0 is compacted. Hence, there will be no pushes to the write channel.

value_test.go, line 1089 at r2 (raw file):

}

func TestBlockedDiscardStats(t *testing.T) {

This test can be simplified

// Regression test for https://github.com/dgraph-io/badger/issues/970
func TestBlockedDiscardStats(t *testing.T) {
	dir, err := ioutil.TempDir("", "badger-test")
	require.NoError(t, err)
	defer os.Remove(dir)

	db, err := Open(getTestOptions(dir))
	require.NoError(t, err)
	// Set discard stats.
	db.vlog.lfDiscardStats = &lfDiscardStats{
		m: map[uint32]int64{0: 0},
	}
	// This is important. Set updateSinceFlush to discardStatsFlushThresold so
	// that the next update call flushes the discard stats.
	db.vlog.lfDiscardStats.updatesSinceFlush = discardStatsFlushThreshold + 1
	require.NoError(t, db.Close())
}

Signed-off-by: பாலாஜி ஜின்னா <balaji@dgraph.io>

jarifibrahim

Reviewed 1 of 2 files at r2, 1 of 1 files at r3.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @ashish-goswami, @balajijinnah, and @manishrjain)

value.go, line 1393 at r3 (raw file):

	if vlog.lfDiscardStats.updatesSinceFlush > discardStatsFlushThreshold {
		vlog.lfDiscardStats.Unlock()
		// flushDiscardStats also aquires lock. So, we need to unlock here.

Typo: aquires => acquires

value.go, line 1429 at r3 (raw file):

// encodedDiscardStats returns []byte representation of lfDiscardStats
// This will be called while storing stats in BadgerDB
// caller should aquire lock before encoding the stats.

Typo: aquire => acquire

Signed-off-by: பாலாஜி ஜின்னா <balaji@dgraph.io>

manishrjain

Reviewed 1 of 1 files at r3, 1 of 1 files at r4.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @ashish-goswami and @balajijinnah)

value.go, line 1430 at r4 (raw file):

// This will be called while storing stats in BadgerDB
// caller should acquire lock before encoding the stats.
func (vlog *valueLog) encodedDiscardStats() []byte {

Possibly use safemutex. So, you can assert that we have at least a read lock.

manishrjain

One comment.

Reviewed 1 of 1 files at r5.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @ashish-goswami and @balajijinnah)

value.go, line 1415 at r5 (raw file):

		Value: vlog.encodedDiscardStats(),
	}}
	req, err := vlog.db.sendToWriteCh(entries)

if err == ErrBlockedWrites { ... }
else if err != nil { ... }

Signed-off-by: பாலாஜி ஜின்னா <balaji@dgraph.io>

poonai

Reviewable status: 1 of 2 files reviewed, 5 unresolved discussions (waiting on @ashish-goswami, @jarifibrahim, and @manishrjain)

value.go, line 1418 at r2 (raw file):

Previously, jarifibrahim (Ibrahim Jarif) wrote…

We don't need this. The BlockedWrites was being returned because the updateSinceFlush was not being reset.
Once we've reset it, the writes will never be triggered.

Here's what was happening in the existing implementation

DB was closed

Discard stats were flushed (but the update counter was not reset)

L0 compaction was triggered

The compaction tried to update the discard stats but since the update counter was on the edge of the threshold, the compaction caused a flush.

This flush failed because the write channel was closed.

The write was triggered because the update counter was not being reset. Since you're resetting the counter now, we won't be flushing the discard stats when L0 is compacted. Hence, there will be no pushes to the write channel.

Done.

value.go, line 1393 at r3 (raw file):

Previously, jarifibrahim (Ibrahim Jarif) wrote…

Typo: aquires => acquires

Done.

value.go, line 1429 at r3 (raw file):

Previously, jarifibrahim (Ibrahim Jarif) wrote…

Typo: aquire => acquire

Done.

value.go, line 1415 at r5 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

if err == ErrBlockedWrites { ... }
else if err != nil { ... }

Done.

value_test.go, line 1089 at r2 (raw file):

Previously, jarifibrahim (Ibrahim Jarif) wrote…

This test can be simplified

// Regression test for https://github.com/dgraph-io/badger/issues/970
func TestBlockedDiscardStats(t *testing.T) {
	dir, err := ioutil.TempDir("", "badger-test")
	require.NoError(t, err)
	defer os.Remove(dir)

	db, err := Open(getTestOptions(dir))
	require.NoError(t, err)
	// Set discard stats.
	db.vlog.lfDiscardStats = &lfDiscardStats{
		m: map[uint32]int64{0: 0},
	}
	// This is important. Set updateSinceFlush to discardStatsFlushThresold so
	// that the next update call flushes the discard stats.
	db.vlog.lfDiscardStats.updatesSinceFlush = discardStatsFlushThreshold + 1
	require.NoError(t, db.Close())
}

Done.

manishrjain

Let's get this merged asap.

Reviewed 1 of 1 files at r6.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @ashish-goswami, @jarifibrahim, and @manishrjain)

danielmai · 2019-08-27T22:47:55Z

@campoy also had a look at this PR and said it looks like the right change to make. Given the LGTMs all around, I'll merge this.

* In updateDiscardStats the vlog.lfDiscardStats lock was acquired twice: once in updateDiscardStats and then a second time when calling flushDiscardStats. Now, we first release the lock before calling flushDiscardStats. * Don't return an error if writes are blocked for discard stats. * Add tests and regression tests. Signed-off-by: பாலாஜி ஜின்னா <balaji@dgraph.io> (cherry picked from commit 398445a)

don't return error if it writes are blocked for discard stats

d0ea1a1

Signed-off-by: பாலாஜி ஜின்னா <balaji@dgraph.io>

poonai requested review from ashish-goswami, manishrjain and a team as code owners August 9, 2019 10:31

pullrequest bot reviewed Aug 9, 2019

View reviewed changes

poonai requested a review from jarifibrahim August 9, 2019 10:31

jarifibrahim suggested changes Aug 9, 2019

View reviewed changes

pullrequest bot reviewed Aug 10, 2019

View reviewed changes

பாலாஜி ஜின்னா added 2 commits August 12, 2019 03:11

Merge branch 'master' of github.com:dgraph-io/badger into balaji/discard

93f15e0

test added

4d4e1ab

Signed-off-by: பாலாஜி ஜின்னா <balaji@dgraph.io>

pullrequest bot reviewed Aug 16, 2019

View reviewed changes

jarifibrahim suggested changes Aug 19, 2019

View reviewed changes

regression test added

20e9315

Signed-off-by: பாலாஜி ஜின்னா <balaji@dgraph.io>

jarifibrahim approved these changes Aug 20, 2019

View reviewed changes

jarifibrahim mentioned this pull request Aug 20, 2019

dead lock in updateDiscardStats on master #993

Closed

fixing typo

6c4d55c

Signed-off-by: பாலாஜி ஜின்னா <balaji@dgraph.io>

manishrjain approved these changes Aug 20, 2019

View reviewed changes

jarifibrahim mentioned this pull request Aug 21, 2019

Too many goroutines blocked dgraph-io/dgraph#3835

Closed

Merge branch 'master' of github.com:dgraph-io/badger into balaji/discard

3cdb2f0

manishrjain suggested changes Aug 26, 2019

View reviewed changes

fix manish comment

90288dd

Signed-off-by: பாலாஜி ஜின்னா <balaji@dgraph.io>

poonai commented Aug 27, 2019

View reviewed changes

manishrjain approved these changes Aug 27, 2019

View reviewed changes

danielmai changed the title ~~don't return error if it writes are blocked for discard stats~~ Fix deadlock when flushing discard stats. Aug 27, 2019

danielmai merged commit 398445a into master Aug 27, 2019

danielmai deleted the balaji/discard branch August 27, 2019 23:01

ashish-goswami mentioned this pull request Aug 28, 2019

Changing the schema after live loader some times results in a deadlock (single node alpha) dgraph-io/dgraph#3875

Closed

poonai mentioned this pull request Sep 10, 2019

deadlock when writing into Badger using multiple routines #1032

Closed

jarifibrahim mentioned this pull request Sep 18, 2019

CompactL0OnClose fails on closing db #970

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix deadlock when flushing discard stats. #976

Fix deadlock when flushing discard stats. #976

poonai commented Aug 9, 2019 •

edited by jarifibrahim

Loading

pullrequest bot left a comment

jarifibrahim left a comment

pullrequest bot left a comment

pullrequest bot left a comment

jarifibrahim left a comment

jarifibrahim left a comment

manishrjain left a comment

manishrjain left a comment

poonai left a comment

manishrjain left a comment

danielmai commented Aug 27, 2019

	if vlog.lfDiscardStats.updatesSinceFlush > discardStatsFlushThreshold {
	if err := vlog.flushDiscardStats(); err != nil {
	return err
	}
	vlog.lfDiscardStats.updatesSinceFlush = 0
	}

	func (db *DB) close() (err error) {
	db.elog.Printf("Closing database")

	if err := db.vlog.flushDiscardStats(); err != nil {
	return errors.Wrap(err, "failed to flush discard stats")
	}

	atomic.StoreInt32(&db.blockWrites, 1)

Fix deadlock when flushing discard stats. #976

Fix deadlock when flushing discard stats. #976

Conversation

poonai commented Aug 9, 2019 • edited by jarifibrahim Loading

pullrequest bot left a comment

Choose a reason for hiding this comment

jarifibrahim left a comment

Choose a reason for hiding this comment

pullrequest bot left a comment

Choose a reason for hiding this comment

pullrequest bot left a comment

Choose a reason for hiding this comment

jarifibrahim left a comment

Choose a reason for hiding this comment

jarifibrahim left a comment

Choose a reason for hiding this comment

manishrjain left a comment

Choose a reason for hiding this comment

manishrjain left a comment

Choose a reason for hiding this comment

poonai left a comment

Choose a reason for hiding this comment

manishrjain left a comment

Choose a reason for hiding this comment

danielmai commented Aug 27, 2019

poonai commented Aug 9, 2019 •

edited by jarifibrahim

Loading