Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: use sync.Cond to handle no-task blocking wait #299

Merged
merged 1 commit into from
Dec 4, 2021

Conversation

rvagg
Copy link
Member

@rvagg rvagg commented Dec 3, 2021

Based on feedback from #284, I think this is closer to what I want. While we currently don't have use for >1 consumer, I want WaitForNoActiveTasks() to be a general, and potentially (or at least safely) multi-consumer blocking call that doesn't return until there are zero active tasks.

@rvagg rvagg requested a review from mvdan December 3, 2021 05:18
case <-tq.noTaskSignal:
}
tq.noTaskCond.L.Lock()
for tq.activeTasks > 0 {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this looks unnecessary, but my assumption here is that there are no guarantees as to who gets the next mutex lock if there are multiple parties waiting on it - both via Wait() and Lock(). So if a second task started and got the lock before our wait caller got a mutex notify then we'd not be in a tq.activeTasks>0 condition even after waking up here.

At least that's how it would work in C++ and Java. Or do we have stronger guarantees in Go about who gets the lock next after a Broadcast()?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the loop is right - the sync.Cond.Wait docs use a loop as an example too.

Copy link
Contributor

@mvdan mvdan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sync.Cond is actually what I was thinking about suggesting if you wanted to make this more flexible :)

note that there's also sync.WaitGroup, but it's designed for a static pattern where you Add tasks upfront, and Wait afterwards. It may panic if you call Add after Wait, which we may do here. It's a very close use case, but not applicable.

case <-tq.noTaskSignal:
}
tq.noTaskCond.L.Lock()
for tq.activeTasks > 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the loop is right - the sync.Cond.Wait docs use a loop as an example too.

}
tq.noTaskCond.L.Unlock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sync.Cond.Broadcast doesn't require holding the lock to be called, so I wonder if you could make this loop lock-free by also continuing to use sync/atomic for activeTasks.

I think that's ideally what we want, because otherwise we're adding some amount of lock contention on each worker. The locks are only held for tiny amounts of time here, but they still add their overhead, whereas atomics are comparatively very cheap.

In other words, I think we can just use sync.Cond for its broadcast feature, and continue using atomics for the counter. If our condition was more complex and we couldn't use atomics, then we'd need the shared lock for sure, but I don't think we absolutely need it here.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Careful: if a sync.Cond.Broadcast() is issued and there isn't anyone Wait()ing to "hear it": the broadcast is discarded. You effectively need at least an implicit ( by some other means ) "lock", to ensure that there is something in Wait()ing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this depends on when the Wait call happens. I assume we call Wait before the workers all go idle. If we may call Wait at a later time, then indeed we need something more. Perhaps Wait should first atomically check if we're already idle, and if so, return immediately. Otherwise, block until the next broadcast.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, the current code already does this - it only Waits inside the loop after it checks that we're not already idle.

I think you're saying that replacing the lock with atomics on the broadcast side could add a race, like the following sequence of events:

  1. WaitForNoActiveTasks grabs the lock
  2. WaitForNoActiveTasks sees the counter is 1, so it enters the loop
  3. worker finishes a task, and broadcasts without grabbing the lock
  4. WaitForNoActiveTasks calls Cond.Wait, which doesn't see the broadcast as it is late

Whereas with Rod's current code:

  1. WaitForNoActiveTasks grabs the lock
  2. WaitForNoActiveTasks sees the counter is 1, so it enters the loop
  3. worker finishes a task, and tries to grab the lock to broadcast - blocking, as WaitForNoActiveTasks has the lock
  4. WaitForNoActiveTasks calls Cond.Wait, releasing the block temporarily
  5. worker grabs the lock and broadcasts
  6. WaitForNoActiveTasks sees the broadcast and finishes

So I think you're right. we don't need to hold the lock to broadcast, like the docs say, but not doing so inserts a form of logic race.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup! This is what I was trying to convey, sorry for being terse!

Copy link
Contributor

@mvdan mvdan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM - as discussed in the thread, I don't think we can make this lock-free on the worker side.

Copy link
Collaborator

@hannahhoward hannahhoward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hannahhoward hannahhoward merged commit 83aebf1 into main Dec 4, 2021
hannahhoward pushed a commit that referenced this pull request Dec 9, 2021
feat: add WorkerTaskQueue#WaitForNoActiveTasks() for tests (#284)

* feat: add WorkerTaskQueue#WaitForNoActiveTasks() for tests

* fixup! feat: add WorkerTaskQueue#WaitForNoActiveTasks() for tests

fix(responsemanager): fix flaky tests

fix(responsemanager): make fix more global

feat: add basic OT tracing for incoming requests

Closes: #271

docs(tests): document tracing test helper utilities

fix(test): increase 1s timeouts to 2s for slow CI (#289)

* fix(test): increase 1s timeouts to 2s for slow CI

* fixup! fix(test): increase 1s timeouts to 2s for slow CI

testutil/chaintypes: simplify maintenance of codegen (#294)

"go generate" now updates the generated code for us.

The separate directory for a main package was unnecessary;
a build-tag-ignored file is enough.

Using gofmt on the resulting source is now unnecessary too,
as upstream has been using go/format on its output for some time.

Finally, re-generate the output source code,
as the last time that was done we were on an older ipld-prime.

ipldutil: use chooser APIs from dagpb and basicnode (#292)

Saves us a bit of extra code, since they were added in summer.
Also avoid making defaultVisitor a variable,
which makes it clearer that it's never a nil func.

While here, replace node/basic with node/basicnode,
as the former has been deprecated in favor of the latter.

Co-authored-by: Hannah Howard <hannah@hannahhoward.net>

fix: use sync.Cond to handle no-task blocking wait (#299)

Ref: #284

Peer Stats function (#298)

* feat(graphsync): add impl method for peer stats

add method that gets current request states by request ID for a given peer

* fix(requestmanager): fix tested method

Add a bit of logging (#301)

* chore(responsemanager): add a bit of logging

* fix(responsemanager): remove code change

chore: short-circuit unnecessary message processing

Expose task queue diagnostics (#302)

* feat(impl): expose task queue diagnostics

* refactor(peerstate): put peerstate in its own module

* refactor(peerstate): make diagnostics return array
@rvagg rvagg deleted the rvagg/taskqueue-sync branch December 15, 2021 01:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants