-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix deadlock when flushing discard stats. #976
Conversation
Signed-off-by: பாலாஜி ஜின்னா <balaji@dgraph.io>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✅ A review job has been created and sent to the PullRequest network.
Check the status or cancel PullRequest code review here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Sch00lb0y We don't need this patch. Look at
Lines 1392 to 1397 in 9d7b751
if vlog.lfDiscardStats.updatesSinceFlush > discardStatsFlushThreshold { | |
if err := vlog.flushDiscardStats(); err != nil { | |
return err | |
} | |
vlog.lfDiscardStats.updatesSinceFlush = 0 | |
} |
After flushing, we reset the update count
But over here
Lines 361 to 369 in 9d7b751
func (db *DB) close() (err error) { | |
db.elog.Printf("Closing database") | |
if err := db.vlog.flushDiscardStats(); err != nil { | |
return errors.Wrap(err, "failed to flush discard stats") | |
} | |
atomic.StoreInt32(&db.blockWrites, 1) | |
We don't reset the counter. That's why compaction triggered another flush and the db crashed.
Fixing the reset counter should fix the issue. Also, try to write a test for it.
Reviewable status: 0 of 1 files reviewed, all discussions resolved (waiting on @ashish-goswami, @jarifibrahim, and @manishrjain)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've reviewed the ticket and sch00lb0y feedback. This appears to be the right approach.
Reviewed with ❤️ by PullRequest
Signed-off-by: பாலாஜி ஜின்னா <balaji@dgraph.io>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✖️ This code review was cancelled. See Details
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 2 files reviewed, 2 unresolved discussions (waiting on @ashish-goswami, @balajijinnah, and @manishrjain)
value.go, line 1418 at r2 (raw file):
// So ignoring it. // https://github.com/dgraph-io/badger/issues/970 if err != ErrBlockedWrites {
We don't need this. The BlockedWrites was being returned because the updateSinceFlush
was not being reset.
Once we've reset it, the writes will never be triggered.
Here's what was happening in the existing implementation
- DB was closed
- Discard stats were flushed (but the update counter was not reset)
- L0 compaction was triggered
- The compaction tried to update the discard stats but since the update counter was on the edge of the threshold, the compaction caused a flush.
- This flush failed because the write channel was closed.
The write was triggered because the update counter was not being reset. Since you're resetting the counter now, we won't be flushing the discard stats when L0 is compacted. Hence, there will be no pushes to the write channel.
value_test.go, line 1089 at r2 (raw file):
} func TestBlockedDiscardStats(t *testing.T) {
This test can be simplified
// Regression test for https://github.com/dgraph-io/badger/issues/970
func TestBlockedDiscardStats(t *testing.T) {
dir, err := ioutil.TempDir("", "badger-test")
require.NoError(t, err)
defer os.Remove(dir)
db, err := Open(getTestOptions(dir))
require.NoError(t, err)
// Set discard stats.
db.vlog.lfDiscardStats = &lfDiscardStats{
m: map[uint32]int64{0: 0},
}
// This is important. Set updateSinceFlush to discardStatsFlushThresold so
// that the next update call flushes the discard stats.
db.vlog.lfDiscardStats.updatesSinceFlush = discardStatsFlushThreshold + 1
require.NoError(t, db.Close())
}
Signed-off-by: பாலாஜி ஜின்னா <balaji@dgraph.io>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 2 files at r2, 1 of 1 files at r3.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @ashish-goswami, @balajijinnah, and @manishrjain)
value.go, line 1393 at r3 (raw file):
if vlog.lfDiscardStats.updatesSinceFlush > discardStatsFlushThreshold { vlog.lfDiscardStats.Unlock() // flushDiscardStats also aquires lock. So, we need to unlock here.
Typo: aquires => acquires
value.go, line 1429 at r3 (raw file):
// encodedDiscardStats returns []byte representation of lfDiscardStats // This will be called while storing stats in BadgerDB // caller should aquire lock before encoding the stats.
Typo: aquire => acquire
Signed-off-by: பாலாஜி ஜின்னா <balaji@dgraph.io>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r3, 1 of 1 files at r4.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @ashish-goswami and @balajijinnah)
value.go, line 1430 at r4 (raw file):
// This will be called while storing stats in BadgerDB // caller should acquire lock before encoding the stats. func (vlog *valueLog) encodedDiscardStats() []byte {
Possibly use safemutex. So, you can assert that we have at least a read lock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r5.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @ashish-goswami and @balajijinnah)
value.go, line 1415 at r5 (raw file):
Value: vlog.encodedDiscardStats(), }} req, err := vlog.db.sendToWriteCh(entries)
if err == ErrBlockedWrites { ... }
else if err != nil { ... }
Signed-off-by: பாலாஜி ஜின்னா <balaji@dgraph.io>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 of 2 files reviewed, 5 unresolved discussions (waiting on @ashish-goswami, @jarifibrahim, and @manishrjain)
value.go, line 1418 at r2 (raw file):
Previously, jarifibrahim (Ibrahim Jarif) wrote…
We don't need this. The BlockedWrites was being returned because the
updateSinceFlush
was not being reset.
Once we've reset it, the writes will never be triggered.Here's what was happening in the existing implementation
- DB was closed
- Discard stats were flushed (but the update counter was not reset)
- L0 compaction was triggered
- The compaction tried to update the discard stats but since the update counter was on the edge of the threshold, the compaction caused a flush.
- This flush failed because the write channel was closed.
The write was triggered because the update counter was not being reset. Since you're resetting the counter now, we won't be flushing the discard stats when L0 is compacted. Hence, there will be no pushes to the write channel.
Done.
value.go, line 1393 at r3 (raw file):
Previously, jarifibrahim (Ibrahim Jarif) wrote…
Typo: aquires => acquires
Done.
value.go, line 1429 at r3 (raw file):
Previously, jarifibrahim (Ibrahim Jarif) wrote…
Typo: aquire => acquire
Done.
value.go, line 1415 at r5 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
if err == ErrBlockedWrites { ... } else if err != nil { ... }
Done.
value_test.go, line 1089 at r2 (raw file):
Previously, jarifibrahim (Ibrahim Jarif) wrote…
This test can be simplified
// Regression test for https://github.com/dgraph-io/badger/issues/970 func TestBlockedDiscardStats(t *testing.T) { dir, err := ioutil.TempDir("", "badger-test") require.NoError(t, err) defer os.Remove(dir) db, err := Open(getTestOptions(dir)) require.NoError(t, err) // Set discard stats. db.vlog.lfDiscardStats = &lfDiscardStats{ m: map[uint32]int64{0: 0}, } // This is important. Set updateSinceFlush to discardStatsFlushThresold so // that the next update call flushes the discard stats. db.vlog.lfDiscardStats.updatesSinceFlush = discardStatsFlushThreshold + 1 require.NoError(t, db.Close()) }
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r6.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @ashish-goswami, @jarifibrahim, and @manishrjain)
@campoy also had a look at this PR and said it looks like the right change to make. Given the LGTMs all around, I'll merge this. |
* In updateDiscardStats the vlog.lfDiscardStats lock was acquired twice: once in updateDiscardStats and then a second time when calling flushDiscardStats. Now, we first release the lock before calling flushDiscardStats. * Don't return an error if writes are blocked for discard stats. * Add tests and regression tests. Signed-off-by: பாலாஜி ஜின்னா <balaji@dgraph.io> (cherry picked from commit 398445a)
Signed-off-by: பாலாஜி ஜின்னா balaji@dgraph.io
Fixes - #970 and #976
This change is