Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stats: add symbol-table contention logging and admin endpoint #8035

Closed
jmarantz opened this issue Aug 25, 2019 · 7 comments · Fixed by #8116
Closed

stats: add symbol-table contention logging and admin endpoint #8035

jmarantz opened this issue Aug 25, 2019 · 7 comments · Fixed by #8116
Assignees
Labels
enhancement Feature requests. Not bugs or questions. no stalebot Disables stalebot from closing an issue
Milestone

Comments

@jmarantz
Copy link
Contributor

Per discussion in #7842 it would be useful to be able to see symbol-table contention by name in logs and an admin endpoint.

@jmarantz jmarantz self-assigned this Aug 25, 2019
@jmarantz
Copy link
Contributor Author

I think this should block #4980 .

@mattklein123 mattklein123 added enhancement Feature requests. Not bugs or questions. no stalebot Disables stalebot from closing an issue labels Aug 26, 2019
@mattklein123 mattklein123 added this to the 1.12.0 milestone Aug 26, 2019
@jmarantz
Copy link
Contributor Author

jmarantz commented Aug 26, 2019

I had a variation on my this idea above which is to check our contention atomic (common/common/mutex_tracer_impl.h) and log a warning if there is actual contention on a particular symbol. This has the advantage of being relatively simple to implement and not at all noisy. The disadvantage is that it will not flag usage of the global lock at all; only if it actually contends in practice, so it will not flag potential problems in advance that may arise as scale (# threads, traffic) increases.

The other options for logging may be moderately complex to implement (a small simplified lru-cache of looked-up names) or somewhat noisy (trace-log all lookups). The simplest option might be just a deque containing the most recent (say) 10 lookups, which may be all of the same name. For the simple deque or the lru-cache, the operational implication is that we'd have to check the logs or the admin endpoint on running systems and note the names that were missed.

@mattklein123
Copy link
Member

mattklein123 commented Aug 27, 2019

I had a variation on my this idea above which is to check our contention atomic (common/common/mutex_tracer_impl.h) and log a warning if there is actual contention on a particular symbol. This has the advantage of being relatively simple to implement and not at all noisy. The disadvantage is that it will not flag usage of the global lock at all; only if it actually contends in practice, so it will not flag potential problems in advance that may arise as scale (# threads, traffic) increases.

As an aside, I was looking at the /contention endpoint recently, and IMO it's only marginally useful as you can't tell what mutex is contended. I haven't looked in detail, but it looks to me like the mutex tracer might support names or tagging, and I think we should be storing that for output in the contention data. I think that would be required for this also?

@jmarantz
Copy link
Contributor Author

You mean via the "opaque name for contented mutex" here?
https://github.com/abseil/abseil-cpp/blob/a0d1e098c2f99694fa399b175a7ccf920762030e/absl/synchronization/mutex.h#L972

Have you figured out how to map that back to the absl::Mutex*? If so we could optionally name our mutexes with some maps. I was thinking to try to capture back-traces at contention. Maybe the opaque handle could be used to see whether there is enough contention to make recording backtraces worthwhile?

FWIW I found the contention metric useful in the process of symbolizing the stats access for http codes, because I could run a load-test with siege, with and without a proposed PR, to determine how that PR affected contention.

In any case I'm leaning away from relying on dynamic contention for this issue, and toward just keeping bounded queues of dynamically accessed StatNames and Symbols, allowing repeats and dropping old ones, for simplicity of implementation and speed. We can then collate duplicates in the admin handler and in log statements. I expect this queue to be rapidly turn over during startup, but should be quiescent during operation (except perhaps in an xDS update). So in operation if we just check the queue it should ideally be filled with old requests from startup.

@mattklein123
Copy link
Member

I didn't look into the mutex tracer code at all, but yeah, was hoping that we could print out a per-mutex map, at least for certain named mutexes, but I'm not sure what is possible.

@jmarantz
Copy link
Contributor Author

It's possible but involves keeping a bunch of maps up-to-date including when mutexes are destructed, I'm leaning against taking this on for two reasons:

  1. The important thing for this issue is the string being dynamically symbolized; the mutex will likely be the main symbol-table mutex, or one of the StatNameSet mutexes, and to name those you'd have to pass in the name to the StatNameSet constructor. Though we may find it necessary to do that anyway.

  2. I think it will be more useful and actionable to get deterministic information about potential contention rather than dynamic information about actual contention.

So if it seems reasonable to you I will pursue the deque approach.

@mattklein123
Copy link
Member

Sure sounds good (I still think it would be useful to think about how to make the contention output more useful, but that's a different issue I agree).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature requests. Not bugs or questions. no stalebot Disables stalebot from closing an issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants