Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix deadlock when flushing discard stats. #976

Merged
merged 7 commits into from
Aug 27, 2019
Merged

Fix deadlock when flushing discard stats. #976

merged 7 commits into from
Aug 27, 2019

Conversation

poonai
Copy link
Contributor

@poonai poonai commented Aug 9, 2019

Signed-off-by: பாலாஜி ஜின்னா balaji@dgraph.io

Fixes - #970 and #976


This change is Reviewable

Signed-off-by: பாலாஜி ஜின்னா <balaji@dgraph.io>
Copy link

@pullrequest pullrequest bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ A review job has been created and sent to the PullRequest network.


Check the status or cancel PullRequest code review here.

Copy link
Contributor

@jarifibrahim jarifibrahim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Sch00lb0y We don't need this patch. Look at

badger/value.go

Lines 1392 to 1397 in 9d7b751

if vlog.lfDiscardStats.updatesSinceFlush > discardStatsFlushThreshold {
if err := vlog.flushDiscardStats(); err != nil {
return err
}
vlog.lfDiscardStats.updatesSinceFlush = 0
}

After flushing, we reset the update count
But over here

badger/db.go

Lines 361 to 369 in 9d7b751

func (db *DB) close() (err error) {
db.elog.Printf("Closing database")
if err := db.vlog.flushDiscardStats(); err != nil {
return errors.Wrap(err, "failed to flush discard stats")
}
atomic.StoreInt32(&db.blockWrites, 1)

We don't reset the counter. That's why compaction triggered another flush and the db crashed.

Fixing the reset counter should fix the issue. Also, try to write a test for it.

Reviewable status: 0 of 1 files reviewed, all discussions resolved (waiting on @ashish-goswami, @jarifibrahim, and @manishrjain)

Copy link

@pullrequest pullrequest bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reviewed the ticket and sch00lb0y feedback. This appears to be the right approach.


Reviewed with ❤️ by PullRequest

பாலாஜி ஜின்னா added 2 commits August 12, 2019 03:11
Signed-off-by: பாலாஜி ஜின்னா <balaji@dgraph.io>
Copy link

@pullrequest pullrequest bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✖️ This code review was cancelled. See Details

Copy link
Contributor

@jarifibrahim jarifibrahim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 2 files reviewed, 2 unresolved discussions (waiting on @ashish-goswami, @balajijinnah, and @manishrjain)


value.go, line 1418 at r2 (raw file):

		// So ignoring it.
		// https://github.com/dgraph-io/badger/issues/970
		if err != ErrBlockedWrites {

We don't need this. The BlockedWrites was being returned because the updateSinceFlush was not being reset.
Once we've reset it, the writes will never be triggered.

Here's what was happening in the existing implementation

  1. DB was closed
  2. Discard stats were flushed (but the update counter was not reset)
  3. L0 compaction was triggered
  4. The compaction tried to update the discard stats but since the update counter was on the edge of the threshold, the compaction caused a flush.
  5. This flush failed because the write channel was closed.

The write was triggered because the update counter was not being reset. Since you're resetting the counter now, we won't be flushing the discard stats when L0 is compacted. Hence, there will be no pushes to the write channel.


value_test.go, line 1089 at r2 (raw file):

}

func TestBlockedDiscardStats(t *testing.T) {

This test can be simplified

// Regression test for https://github.com/dgraph-io/badger/issues/970
func TestBlockedDiscardStats(t *testing.T) {
	dir, err := ioutil.TempDir("", "badger-test")
	require.NoError(t, err)
	defer os.Remove(dir)

	db, err := Open(getTestOptions(dir))
	require.NoError(t, err)
	// Set discard stats.
	db.vlog.lfDiscardStats = &lfDiscardStats{
		m: map[uint32]int64{0: 0},
	}
	// This is important. Set updateSinceFlush to discardStatsFlushThresold so
	// that the next update call flushes the discard stats.
	db.vlog.lfDiscardStats.updatesSinceFlush = discardStatsFlushThreshold + 1
	require.NoError(t, db.Close())
}

Signed-off-by: பாலாஜி ஜின்னா <balaji@dgraph.io>
Copy link
Contributor

@jarifibrahim jarifibrahim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 2 files at r2, 1 of 1 files at r3.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @ashish-goswami, @balajijinnah, and @manishrjain)


value.go, line 1393 at r3 (raw file):

	if vlog.lfDiscardStats.updatesSinceFlush > discardStatsFlushThreshold {
		vlog.lfDiscardStats.Unlock()
		// flushDiscardStats also aquires lock. So, we need to unlock here.

Typo: aquires => acquires


value.go, line 1429 at r3 (raw file):

// encodedDiscardStats returns []byte representation of lfDiscardStats
// This will be called while storing stats in BadgerDB
// caller should aquire lock before encoding the stats.

Typo: aquire => acquire

Signed-off-by: பாலாஜி ஜின்னா <balaji@dgraph.io>
Copy link
Contributor

@manishrjain manishrjain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 1 files at r3, 1 of 1 files at r4.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @ashish-goswami and @balajijinnah)


value.go, line 1430 at r4 (raw file):

// This will be called while storing stats in BadgerDB
// caller should acquire lock before encoding the stats.
func (vlog *valueLog) encodedDiscardStats() []byte {

Possibly use safemutex. So, you can assert that we have at least a read lock.

Copy link
Contributor

@manishrjain manishrjain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: One comment.

Reviewed 1 of 1 files at r5.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @ashish-goswami and @balajijinnah)


value.go, line 1415 at r5 (raw file):

		Value: vlog.encodedDiscardStats(),
	}}
	req, err := vlog.db.sendToWriteCh(entries)
if err == ErrBlockedWrites { ... }
else if err != nil { ... }

Signed-off-by: பாலாஜி ஜின்னா <balaji@dgraph.io>
Copy link
Contributor Author

@poonai poonai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 2 files reviewed, 5 unresolved discussions (waiting on @ashish-goswami, @jarifibrahim, and @manishrjain)


value.go, line 1418 at r2 (raw file):

Previously, jarifibrahim (Ibrahim Jarif) wrote…

We don't need this. The BlockedWrites was being returned because the updateSinceFlush was not being reset.
Once we've reset it, the writes will never be triggered.

Here's what was happening in the existing implementation

  1. DB was closed
  2. Discard stats were flushed (but the update counter was not reset)
  3. L0 compaction was triggered
  4. The compaction tried to update the discard stats but since the update counter was on the edge of the threshold, the compaction caused a flush.
  5. This flush failed because the write channel was closed.

The write was triggered because the update counter was not being reset. Since you're resetting the counter now, we won't be flushing the discard stats when L0 is compacted. Hence, there will be no pushes to the write channel.

Done.


value.go, line 1393 at r3 (raw file):

Previously, jarifibrahim (Ibrahim Jarif) wrote…

Typo: aquires => acquires

Done.


value.go, line 1429 at r3 (raw file):

Previously, jarifibrahim (Ibrahim Jarif) wrote…

Typo: aquire => acquire

Done.


value.go, line 1415 at r5 (raw file):

Previously, manishrjain (Manish R Jain) wrote…
if err == ErrBlockedWrites { ... }
else if err != nil { ... }

Done.


value_test.go, line 1089 at r2 (raw file):

Previously, jarifibrahim (Ibrahim Jarif) wrote…

This test can be simplified

// Regression test for https://github.com/dgraph-io/badger/issues/970
func TestBlockedDiscardStats(t *testing.T) {
	dir, err := ioutil.TempDir("", "badger-test")
	require.NoError(t, err)
	defer os.Remove(dir)

	db, err := Open(getTestOptions(dir))
	require.NoError(t, err)
	// Set discard stats.
	db.vlog.lfDiscardStats = &lfDiscardStats{
		m: map[uint32]int64{0: 0},
	}
	// This is important. Set updateSinceFlush to discardStatsFlushThresold so
	// that the next update call flushes the discard stats.
	db.vlog.lfDiscardStats.updatesSinceFlush = discardStatsFlushThreshold + 1
	require.NoError(t, db.Close())
}

Done.

Copy link
Contributor

@manishrjain manishrjain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: Let's get this merged asap.

Reviewed 1 of 1 files at r6.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @ashish-goswami, @jarifibrahim, and @manishrjain)

@danielmai
Copy link
Contributor

@campoy also had a look at this PR and said it looks like the right change to make. Given the LGTMs all around, I'll merge this.

@danielmai danielmai changed the title don't return error if it writes are blocked for discard stats Fix deadlock when flushing discard stats. Aug 27, 2019
@danielmai danielmai merged commit 398445a into master Aug 27, 2019
@danielmai danielmai deleted the balaji/discard branch August 27, 2019 23:01
jarifibrahim pushed a commit that referenced this pull request Mar 12, 2020
* In updateDiscardStats the vlog.lfDiscardStats lock was acquired
  twice: once in updateDiscardStats and then a second time when
  calling flushDiscardStats. Now, we first release the lock before
  calling flushDiscardStats.

* Don't return an error if writes are blocked for discard
  stats.

* Add tests and regression tests.

Signed-off-by: பாலாஜி ஜின்னா <balaji@dgraph.io>
(cherry picked from commit 398445a)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants