Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose task queue diagnostics #302

Merged
merged 4 commits into from
Dec 9, 2021
Merged

Conversation

hannahhoward
Copy link
Collaborator

Goals

Provide information about the Graphsync task queues in terms of what tasks are active and pending for a given peer

Implementation

  • Expose newly implemented PeerTopics on the TaskQueue itself
  • Add converted information from this method to the PeerStats method on the main implementation to expose the current active/pending requests
  • More importantly provide a diagnostic method that examples request states and the active and pending queues and provides information about which requests are in unexpected states

@rvagg
Copy link
Member

rvagg commented Dec 8, 2021

OK, so all things being equal and stable, with this I should be able to get a PeerStats() and then call QueueDiagnostics() on either the outgoing or incoming request+queue state pair and get a len(x) == 0 result to indicate that there are no bugs and therefore dangling requests or worker queue jobs? So it's a len(x)>0 === 🚨 type diagnostic?

Copy link
Member

@rvagg rvagg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems fine to me, 3 comments though:

  1. The reach-around for toRequestId() is unfortunate but I suppose impossible to deal with without exposing a larger surface area of those internal APIs or a major refactor. Maybe a warning that there's work to be done in the clarity of that API.
  2. We don't have any duplicate ID checking at either collection or within QueueDiagnostics() (e.g. checking whether the ID exists in the matched* maps). I know there's a bit scattered around elsewhere during actual execution, is that enough to put aside concerns about possible duplicates?
  3. PeerStats() doesn't do any synchronization so isn't it theoretically possible to catch mismatching states between the requests and the worker queue due to timing differences in execution of the collection at those points? How much does that undermine the utility of this QueueDiagnostics()?

@hannahhoward
Copy link
Collaborator Author

@rvagg I think you make some good points and I'm gonna finagle this a bit.

@hannahhoward
Copy link
Collaborator Author

@rvagg I think I've found a way to solve most of our problems. unless you object, i'm going to merge

Copy link
Member

@rvagg rvagg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👌 will be interesting to see if this finds anything in production

@hannahhoward hannahhoward merged commit f08c2ed into main Dec 9, 2021
hannahhoward pushed a commit that referenced this pull request Dec 9, 2021
feat: add WorkerTaskQueue#WaitForNoActiveTasks() for tests (#284)

* feat: add WorkerTaskQueue#WaitForNoActiveTasks() for tests

* fixup! feat: add WorkerTaskQueue#WaitForNoActiveTasks() for tests

fix(responsemanager): fix flaky tests

fix(responsemanager): make fix more global

feat: add basic OT tracing for incoming requests

Closes: #271

docs(tests): document tracing test helper utilities

fix(test): increase 1s timeouts to 2s for slow CI (#289)

* fix(test): increase 1s timeouts to 2s for slow CI

* fixup! fix(test): increase 1s timeouts to 2s for slow CI

testutil/chaintypes: simplify maintenance of codegen (#294)

"go generate" now updates the generated code for us.

The separate directory for a main package was unnecessary;
a build-tag-ignored file is enough.

Using gofmt on the resulting source is now unnecessary too,
as upstream has been using go/format on its output for some time.

Finally, re-generate the output source code,
as the last time that was done we were on an older ipld-prime.

ipldutil: use chooser APIs from dagpb and basicnode (#292)

Saves us a bit of extra code, since they were added in summer.
Also avoid making defaultVisitor a variable,
which makes it clearer that it's never a nil func.

While here, replace node/basic with node/basicnode,
as the former has been deprecated in favor of the latter.

Co-authored-by: Hannah Howard <hannah@hannahhoward.net>

fix: use sync.Cond to handle no-task blocking wait (#299)

Ref: #284

Peer Stats function (#298)

* feat(graphsync): add impl method for peer stats

add method that gets current request states by request ID for a given peer

* fix(requestmanager): fix tested method

Add a bit of logging (#301)

* chore(responsemanager): add a bit of logging

* fix(responsemanager): remove code change

chore: short-circuit unnecessary message processing

Expose task queue diagnostics (#302)

* feat(impl): expose task queue diagnostics

* refactor(peerstate): put peerstate in its own module

* refactor(peerstate): make diagnostics return array
@mvdan mvdan deleted the feat/expose-peer-task-queue branch December 15, 2021 14:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants