stats: add symbol-table contention logging and admin endpoint #8035

jmarantz · 2019-08-25T07:27:03Z

Per discussion in #7842 it would be useful to be able to see symbol-table contention by name in logs and an admin endpoint.

jmarantz · 2019-08-25T07:27:42Z

I think this should block #4980 .

jmarantz · 2019-08-26T23:57:12Z

I had a variation on my this idea above which is to check our contention atomic (common/common/mutex_tracer_impl.h) and log a warning if there is actual contention on a particular symbol. This has the advantage of being relatively simple to implement and not at all noisy. The disadvantage is that it will not flag usage of the global lock at all; only if it actually contends in practice, so it will not flag potential problems in advance that may arise as scale (# threads, traffic) increases.

The other options for logging may be moderately complex to implement (a small simplified lru-cache of looked-up names) or somewhat noisy (trace-log all lookups). The simplest option might be just a deque containing the most recent (say) 10 lookups, which may be all of the same name. For the simple deque or the lru-cache, the operational implication is that we'd have to check the logs or the admin endpoint on running systems and note the names that were missed.

mattklein123 · 2019-08-27T17:45:44Z

I had a variation on my this idea above which is to check our contention atomic (common/common/mutex_tracer_impl.h) and log a warning if there is actual contention on a particular symbol. This has the advantage of being relatively simple to implement and not at all noisy. The disadvantage is that it will not flag usage of the global lock at all; only if it actually contends in practice, so it will not flag potential problems in advance that may arise as scale (# threads, traffic) increases.

As an aside, I was looking at the /contention endpoint recently, and IMO it's only marginally useful as you can't tell what mutex is contended. I haven't looked in detail, but it looks to me like the mutex tracer might support names or tagging, and I think we should be storing that for output in the contention data. I think that would be required for this also?

jmarantz · 2019-08-29T02:15:25Z

You mean via the "opaque name for contented mutex" here?
https://github.com/abseil/abseil-cpp/blob/a0d1e098c2f99694fa399b175a7ccf920762030e/absl/synchronization/mutex.h#L972

Have you figured out how to map that back to the absl::Mutex*? If so we could optionally name our mutexes with some maps. I was thinking to try to capture back-traces at contention. Maybe the opaque handle could be used to see whether there is enough contention to make recording backtraces worthwhile?

FWIW I found the contention metric useful in the process of symbolizing the stats access for http codes, because I could run a load-test with siege, with and without a proposed PR, to determine how that PR affected contention.

In any case I'm leaning away from relying on dynamic contention for this issue, and toward just keeping bounded queues of dynamically accessed StatNames and Symbols, allowing repeats and dropping old ones, for simplicity of implementation and speed. We can then collate duplicates in the admin handler and in log statements. I expect this queue to be rapidly turn over during startup, but should be quiescent during operation (except perhaps in an xDS update). So in operation if we just check the queue it should ideally be filled with old requests from startup.

mattklein123 · 2019-08-29T19:25:47Z

I didn't look into the mutex tracer code at all, but yeah, was hoping that we could print out a per-mutex map, at least for certain named mutexes, but I'm not sure what is possible.

jmarantz · 2019-08-30T13:50:03Z

It's possible but involves keeping a bunch of maps up-to-date including when mutexes are destructed, I'm leaning against taking this on for two reasons:

The important thing for this issue is the string being dynamically symbolized; the mutex will likely be the main symbol-table mutex, or one of the StatNameSet mutexes, and to name those you'd have to pass in the name to the StatNameSet constructor. Though we may find it necessary to do that anyway.
I think it will be more useful and actionable to get deterministic information about potential contention rather than dynamic information about actual contention.

So if it seems reasonable to you I will pursue the deque approach.

mattklein123 · 2019-08-30T16:03:04Z

Sure sounds good (I still think it would be useful to think about how to make the contention output more useful, but that's a different issue I agree).

jmarantz self-assigned this Aug 25, 2019

jmarantz mentioned this issue Aug 25, 2019

stats: Clean up all calls to Scope::counter() et al in production code. #7842

Merged

mattklein123 added enhancement Feature requests. Not bugs or questions. no stalebot Disables stalebot from closing an issue labels Aug 26, 2019

mattklein123 added this to the 1.12.0 milestone Aug 26, 2019

jmarantz mentioned this issue Sep 2, 2019

stats: Remember recent lookups and display them in an admin endpoint #8116

Merged

jmarantz closed this as completed in #8116 Oct 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stats: add symbol-table contention logging and admin endpoint #8035

stats: add symbol-table contention logging and admin endpoint #8035

jmarantz commented Aug 25, 2019

jmarantz commented Aug 25, 2019

jmarantz commented Aug 26, 2019 •

edited

Loading

mattklein123 commented Aug 27, 2019 •

edited

Loading

jmarantz commented Aug 29, 2019

mattklein123 commented Aug 29, 2019

jmarantz commented Aug 30, 2019

mattklein123 commented Aug 30, 2019

stats: add symbol-table contention logging and admin endpoint #8035

stats: add symbol-table contention logging and admin endpoint #8035

Comments

jmarantz commented Aug 25, 2019

jmarantz commented Aug 25, 2019

jmarantz commented Aug 26, 2019 • edited Loading

mattklein123 commented Aug 27, 2019 • edited Loading

jmarantz commented Aug 29, 2019

mattklein123 commented Aug 29, 2019

jmarantz commented Aug 30, 2019

mattklein123 commented Aug 30, 2019

jmarantz commented Aug 26, 2019 •

edited

Loading

mattklein123 commented Aug 27, 2019 •

edited

Loading