collect better memory stats #3612

drahnr · 2021-08-10T12:35:49Z

Introduces new metrics to collect total allocation and resident memory size in bytes and publish to prometheus. Little aid in tracking down OOM conditions by providing better data on allocations.

bkchr · 2021-08-10T22:08:46Z

Isn't that already provided by prometheus? We already have memory statistics about the nodes.

bkchr · 2021-08-10T22:10:53Z

paritytech/substrate#6822

drahnr · 2021-08-10T22:28:49Z

The idea is to provide a less coarse view on the allocations. The system view of jemalloc is too coarse for good correlations between other event occurences and the total consumption, with this we can obtain the inner stats of jemalloc synced to the time of i.e. channel fill snapshots.

TL;DR yes, can be done on OS level, but this is more useful for root cause analysis

node/metrics/src/metronome.rs

bkchr · 2021-08-11T09:49:08Z

The idea is to provide a less coarse view on the allocations. The system view of jemalloc is too coarse for good correlations between other event occurences and the total consumption, with this we can obtain the inner stats of jemalloc synced to the time of i.e. channel fill snapshots.

TL;DR yes, can be done on OS level, but this is more useful for root cause analysis

How do you want to do a root cause analysis with this? You will just see that memory usage goes up, the same as the system will tell you. You want to say that the system update interval is not good enough for the memory usage?

drahnr · 2021-08-11T10:06:33Z

It's good enough to detect OOM conditions, but it's not good enough to dig into the allocations.

For the linux client we use jemalloc, as such you will get somewhat jagged / fuzzy correlations given that multiple arenas are used with jemalloc.

Now this exhibits a rough representation of the true allocations to the system which then can be reported by a system level metrics collector. But this gives little insight on how many allocations, what granularity and only a rough estimate on time to allocation correlations.

Using the internal jemalloc stats avoids this. Now you can remove the correct amount of memory consumed by i.e. the channels (channel based memory consumption reporting TBD) and work with a less noisy residue memory allocation and correlate these in grafana with whatever is a reasonable hypothesis. And this is helpful when digging into OOM case root causes.

eskimor · 2021-08-12T09:17:51Z

Hi @koute ! How did you track down the issue in statement distribution? Would this PR have made it easier?

koute · 2021-08-12T09:42:08Z

How did you track down the issue in statement distribution?

Using my memory profiler, but using it obviously requires being able to reproduce the issue, and you also need to explicitly run Polkadot/Substrate with it hooked in.

Would this PR have made it easier?

Hmm... from what I can see not really. This is still way too coarse to be useful for actually figuring out where exactly the issue originates.

Simply being able to correlate the memory growth with some other events we track in Prometheus is not that useful for a codebase of our size, since even if it might give us some idea as to in what general area the problem is it will usually still leave a million potential places where the problem could have originated, so you need to do a retest anyway to figure it out.

It'd be nice to have some middleground solution between "we only have a rough graph of memory usage" and "we hook a profiler and know about every allocation" which could be always turned on for every node by default, but I don't think this is it.

eskimor · 2021-08-12T09:49:03Z

Nice, thanks @koute !

eskimor

Looks good to me. Also not sure, it is going to help much in practice though.

node/metrics/src/memory_stats.rs

drahnr · 2021-08-12T10:04:12Z

@koute the features you desire will be added in a separate PR, jemalloc does the heavy lifting there too (statistical profiling), but this is not in scope for this PR.

node/metrics/src/memory_stats.rs

node/overseer/src/metrics.rs

drahnr self-assigned this Aug 10, 2021

drahnr force-pushed the bernhard-memory-stats branch from 4d96001 to 777cd6a Compare August 10, 2021 14:35

drahnr commented Aug 10, 2021

View reviewed changes

node/metrics/src/metronome.rs Show resolved Hide resolved

drahnr changed the title ~~collect memory stats~~ collect better memory stats Aug 11, 2021

drahnr requested review from eskimor and ordian August 12, 2021 08:35

eskimor approved these changes Aug 12, 2021

View reviewed changes

node/metrics/src/memory_stats.rs Show resolved Hide resolved

drahnr commented Aug 12, 2021

View reviewed changes

node/metrics/src/memory_stats.rs Show resolved Hide resolved

drahnr added 3 commits August 12, 2021 19:31

add jemalloc memory statistics tracking

ef15477

chore: move Metronome in a separate file

1a37a2c

add meta flag spellcheck

8dd45d4

drahnr force-pushed the bernhard-memory-stats branch from 1141630 to 8dd45d4 Compare August 12, 2021 17:40

bkchr reviewed Aug 12, 2021

View reviewed changes

node/overseer/src/metrics.rs Outdated Show resolved Hide resolved

node/overseer/src/metrics.rs Outdated Show resolved Hide resolved

drahnr added 3 commits August 13, 2021 11:13

adjust metrics names

7ab706a

Merge remote-tracking branch 'origin/master' into bernhard-memory-stats

c1ebd1f

account for new metrics in test

75a0e54

drahnr merged commit ed50a91 into master Aug 13, 2021

drahnr deleted the bernhard-memory-stats branch August 13, 2021 10:40

koute mentioned this pull request Sep 15, 2021

Move memory stats gathering from polkadot to parity-util-mem paritytech/parity-common#588

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

collect better memory stats #3612

collect better memory stats #3612

drahnr commented Aug 10, 2021

bkchr commented Aug 10, 2021

bkchr commented Aug 10, 2021

drahnr commented Aug 10, 2021 •

edited

Loading

bkchr commented Aug 11, 2021

drahnr commented Aug 11, 2021 •

edited

Loading

eskimor commented Aug 12, 2021

koute commented Aug 12, 2021

eskimor commented Aug 12, 2021

eskimor left a comment

drahnr commented Aug 12, 2021

collect better memory stats #3612

collect better memory stats #3612

Conversation

drahnr commented Aug 10, 2021

bkchr commented Aug 10, 2021

bkchr commented Aug 10, 2021

drahnr commented Aug 10, 2021 • edited Loading

bkchr commented Aug 11, 2021

drahnr commented Aug 11, 2021 • edited Loading

eskimor commented Aug 12, 2021

koute commented Aug 12, 2021

eskimor commented Aug 12, 2021

eskimor left a comment

Choose a reason for hiding this comment

drahnr commented Aug 12, 2021

drahnr commented Aug 10, 2021 •

edited

Loading

drahnr commented Aug 11, 2021 •

edited

Loading