Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix NPE due to race with a closing series #3056

Merged
merged 10 commits into from
Mar 9, 2021
Merged

Conversation

justinjc
Copy link
Collaborator

@justinjc justinjc commented Jan 2, 2021

Fixes #2573 by adding a closed flag in the series so that we don't attempt to use a nil ID when the series gets evicted from the wired list.

This adds TestWiredListPanic that repros #2573 after running for a long duration (30 minutes or longer).

New options were introduced to increase the frequency of ticks so that the test is more likely to hit the race condition bug.

Special notes for your reviewer:

Does this PR introduce a user-facing and/or backwards incompatible change?:

NONE

Does this PR require updating code package or user-facing documentation?:

NONE

@justinjc justinjc force-pushed the juchan/wiredlist-panic branch 2 times, most recently from 3f616e6 to a9d0f17 Compare February 18, 2021 20:57
@justinjc justinjc changed the title [WIP] Test to repro WiredList NPE Fix NPE due to race with a closing series Feb 18, 2021
@codecov
Copy link

codecov bot commented Feb 19, 2021

Codecov Report

Merging #3056 (4f45430) into master (4f45430) will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #3056   +/-   ##
=======================================
  Coverage    72.4%    72.4%           
=======================================
  Files        1099     1099           
  Lines      101973   101973           
=======================================
  Hits        73917    73917           
  Misses      22970    22970           
  Partials     5086     5086           
Flag Coverage Δ
aggregator 76.7% <0.0%> (ø)
cluster 84.9% <0.0%> (ø)
collector 84.3% <0.0%> (ø)
dbnode 78.9% <0.0%> (ø)
m3em 74.4% <0.0%> (ø)
m3ninx 73.5% <0.0%> (ø)
metrics 19.8% <0.0%> (ø)
msg 74.6% <0.0%> (ø)
query 67.3% <0.0%> (ø)
x 80.4% <0.0%> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.


Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4f45430...51e4a80. Read the comment docs.

// If the series is closed, the ID will be nil, causing the ID equality
// check to panic. Further, closing the series resets the internal
// cached blocks, so we don't need to continue with the logic below.
if s.closed || !id.Equal(s.id) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need closed? I think it might be better to do nil checks, e.g.:

if id == nil || s.id == nil || !id.Equal(s.id) {
  return
}

Also, it's not clear whether we need to be worried about s itself being nil, but in that case:

if s == nil || id == nil {
  return
}

s.Lock()
defer s.Unlock()

if s.id == nil || !id.Equal(s.id) {
  return
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably ok not to check s since that's a real concerning code bug that should be caught if calling on a nil ptr, but +1 to checking if if s.closed || s.id == nil || !id.Equal(s.id) .

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

@@ -421,7 +422,8 @@ func NewTestSetup(
blockRetrieverMgr := block.NewDatabaseBlockRetrieverManager(
func(md namespace.Metadata, shardSet sharding.ShardSet) (block.DatabaseBlockRetriever, error) {
retrieverOpts := fs.NewBlockRetrieverOptions().
SetBlockLeaseManager(blockLeaseManager)
SetBlockLeaseManager(blockLeaseManager).
SetCacheBlocksOnRetrieve(true)
Copy link
Collaborator

@robskillington robskillington Mar 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, can we avoid setting this to true for all tests and just opt this in for TestWiredListPanic?

By default we set this to false for all open source builds now so it's not really indicative of the broadly deployed set of configurations is (and would rather keep the majority of the integration tests representative of that if possible).

Copy link
Collaborator

@robskillington robskillington left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM other than minor comments

@justinjc justinjc merged commit c0d6797 into master Mar 9, 2021
@justinjc justinjc deleted the juchan/wiredlist-panic branch March 9, 2021 01:32
soundvibe added a commit that referenced this pull request Mar 9, 2021
* master: (22 commits)
  Remove deprecated fields (#3327)
  Add quotas to Permits (#3333)
  [aggregator] Drop messages that have a drop policy applied (#3341)
  Fix NPE due to race with a closing series (#3056)
  [coordinator] Apply auto-mapping rules if-and-only-if no drop policies are in effect (#3339)
  [aggregator] Add validation in AddTimedWithStagedMetadatas (#3338)
  [coordinator] Fix panic in Ready endpoint for admin coordinator (#3335)
  [instrument] Config option to emit detailed Go runtime metrics only (#3332)
  [aggregator] Sort heap in one go, instead of iterating one-by-one (#3331)
  [pool] Add support for dynamic, sync.Pool backed, object pools (#3334)
  Enable PANIC_ON_INVARIANT_VIOLATED for tests (#3326)
  [aggregator] CanLead for unflushed window takes BufferPast into account (#3328)
  Optimize StagedMetadatas conversion (#3330)
  [m3msg] Improve message scan performance (#3319)
  [dbnode] Add reason tag to bootstrap retries metric (#3317)
  [coordinator] Enable rule filtering on prom metric type (#3325)
  Update m3dbnode-all-config.yml (#3204)
  [coordinator] Include Type in RollupOp.Equal (#3322)
  [coordinator] Simplify iteration logic of matchRollupTarget (#3321)
  [coordinator] Add rollup type to remove specific dimensions (#3318)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Panic due to NPE when processing WiredList
3 participants