[Breaking][core.metrics] Remove `process` field from ops metrics & /stats API #104124

lukeelmers · 2021-07-01T15:29:19Z

Part of #68626
Blocked by #104031

Summary

Once #104031 is complete and the existing process field has been deprecated from our ops metrics & /stats APIs, we'll need to remove it. This is necessary to support a multi-process Kibana as outlined in the clustering RFC: #94057

Alternative

As outlined in the RFC, there is a way that we could keep the existing field in place and avoid breaking it in 8.0, but it has significant downsides.

The alternative would essentially be aggregating each of the metrics based on data reported from each worker:

{
  // ...
  "process": {
    "memory": {
      "heap": {
        "total_bytes": 533581824, // sum of coordinator + workers
        "used_bytes": 296297424, // sum of coordinator + workers
        "size_limit": 4345298944 // sum of coordinator + workers
      },
      "resident_set_size_bytes": 563625984 // sum of coordinator + workers
    },
    "pid": 52646, // pid of the coordinator
    "event_loop_delay": 0.22967800498008728, // max of coordinator + workers
    "uptime_ms": 1706021.930404 // uptime of the coordinator
  },
  // ...
}

size_limit in particular could be confusing
sum of available/used heap & node rss is straightforward
event_loop_delay max mostly makes sense, as we are mostly only interested in that number if it is high anyway
pid and uptime_in_millis from the coordinator make sense, especially as long as we are killing
all workers any time one of them dies. However, in the future if we respawn workers that die, this could be
misleading.

Overall, this approach means the data are less accurate and misleading, and are therefore less helpful for users. That's why making the breaking change in 8.0 would be preferred.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-07-01T15:29:21Z

Pinging @elastic/kibana-core (Team:Core)

Bamieh · 2021-07-06T14:44:15Z

The current event_loop_delay is not really useful since the delay is captured only at collection time rather than an ongoing process.

Our new event loop delay collector captures this metric more accurately in a histogram throughout the duration of the process. we might wawnt to rely on that approach or move it around if needed. Related: #98673

lukeelmers · 2021-07-06T21:26:33Z

Our new event loop delay collector captures this metric more accurately in a histogram throughout the duration of the process. we might wawnt to rely on that approach or move it around if needed

👍 Added a note on this to the issue where we are discussing the new format (#104031)

afharo · 2021-08-16T15:48:25Z

We might need to sync with Metricbeat and Monitoring when applying these changes.

afharo · 2021-08-24T10:29:58Z

I created this issue for the Metricbeat team to track them: elastic/beats#27569

afharo · 2021-08-24T10:33:46Z

Also adding that this may be a breaking change to GET /api/status since it also shares the Ops Metrics.

matschaffer · 2021-08-30T02:47:00Z

It looks like

kibana/x-pack/plugins/monitoring/public/components/kibana/instances/instances.tsx

Line 136 in b03d85a

field: 'process.memory.resident_set_size_in_bytes',

has a reference to the process object to retrieve RSS information.

Do we have an alternate field to get information about how much memory kibana is using?

pgayvallet · 2021-09-02T11:39:50Z

Do we have an alternate field to get information about how much memory kibana is using?

The alternative would essentially be aggregating each of the metrics based on data reported from each worker

As said in the issue's description, aggregating some of these values don't really make sense. However, for monitoring purposes, maybe we should introduce the aggregated values for a subset of the memory information? the RSS being the most obvious example, as we can imagine monitoring processes wanted to know the total amount of memory Kibana is consuming. As this feature is a breaking change anyway, we should probably use a new section for that, such as aggregatedData, or something?

matschaffer · 2021-09-07T07:38:55Z

I think in the end it'll be important for kibana developers and administrators to know how the app is using memory.

Since this is probably common for any node.js process, I wonder what @elastic/apm-agent-node-js team has explored in that space.

Having a common approach between both APM & metricbeat could be beneficial down the road.

trentm · 2021-09-07T17:05:50Z

I wonder what @elastic/apm-agent-node-js team has explored in that space.

Nothing much at all, unfortunately.

The Node.js APM agent does no instrumentation specific to the cluster module or aggregation of metrics across separate cluster processes. Assuming the APM agent is started outside of cluster.is{Master|Primary|Worker} logic, there will be an APM agent in each process operating independently. Every metricsInterval (default 30s) the APM agent will report metrics like the following for each process:

    {
        "metadata": {
...
            "process": {
                "pid": 5606,
                "ppid": 5605,
...
    }
    {
        "metricset": {
            "samples": {
                "system.cpu.total.norm.pct": { "value": 0.022109233365979697 },
                "system.memory.total": { "value": 68719476736 },
                "system.memory.actual.free": { "value": 42007408640 },
                "system.process.cpu.total.norm.pct": { "value": 0.0005280573627318066 },
                "system.process.cpu.system.norm.pct": { "value": 0.0001569328440209994 },
                "system.process.cpu.user.norm.pct": { "value": 0.0003711245187108072 },
                "system.process.memory.rss.bytes": { "value": 36159488 },
                "nodejs.handles.active": { "value": 4 },
                "nodejs.requests.active": { "value": 0 },
                "nodejs.eventloop.delay.avg.ms": { "value": 1.6510719999999992 },
                "nodejs.memory.heap.allocated.bytes": { "value": 11169792 },
                "nodejs.memory.heap.used.bytes": { "value": 8776264 },
                "nodejs.memory.external.bytes": { "value": 1515422 },
                "nodejs.memory.arrayBuffers.bytes": { "value": 17655 }
            },
            "timestamp": 1631033459971000,
            "tags": {   // aka "labels" once imported to Elasticsearch
                "hostname": "pink.local",
                "env": "development"
            }
        }
    }

It might be useful to add a label (via globalLabels APM config) to metrics for whether the process is a primary (coordinator) or a worker.

matschaffer · 2021-09-08T03:03:05Z

there will be an APM agent in each process operating independently.

Does the APM UI then represent each cluster process as it's own host? and if so how do they get named? I'd be curious to see it in the UI if possible.

On the kibana side do the workers have any sort of designation like maybe: web, worker, etc? Or are do we just get N generic kibana processes any time someone starts kibana?

Is there a way we can run kibana in cluster mode today to see what it looks like in the OS process table maybe?

Trying to wrap my head around how we can show a kibana admin real data about memory consumption without it being confusing to look at.

pgayvallet · 2021-09-08T07:08:02Z

On the kibana side do the workers have any sort of designation like maybe: web, worker, etc?

We're not planning yet on having specialized workers. We'll have one coordinator, and N workers, with each working having a unique identifier. E.g in a 2-worker instance, we'll have coordinator, worker-1 and worker-2

matschaffer · 2021-09-09T03:36:15Z

And (stepping back from APIs) how would an operator know if the memory usage of the coordinator and workers is reasonable?

Even just a bash example would be fine at this stage I think.

pgayvallet · 2021-09-09T06:26:41Z

And (stepping back from APIs) how would an operator know if the memory usage of the coordinator and workers is reasonable?

As an operator, I'm not sure the memory usage of specific processes is really the first thing to monitor compared to the aggregated RSS from all processes? Which is why I was asking if it may make sense to expose some of the aggregated values (in #104124 (comment))

matschaffer · 2021-09-09T09:55:05Z

aggregated RSS from all processes?

For sure. I could see a case for both. On one side you want to make sure the whole package isn't using more ram than you're willing to give it. On the other wide we probably want to be able to get performance information for each worker (especially kibana devs trying to help customers root out performance issues).

matschaffer · 2021-10-11T15:44:04Z

@pgayvallet any news on what the new data from kibana is expected to look like?

It'd be great to have a way to exercise the new APIs so we can start working on metricbeat/UI changes required.

matschaffer · 2021-10-11T15:45:54Z

One thought is that if the aggregated API doesn't seem feasible, if we have a way metricbeat can query each subprocess and send that to a monitoring cluster, we could aggregate at query time in the UI.

mshustov · 2021-10-11T15:59:24Z

@pgayvallet any news on what the new data from kibana is expected to look like?

@matschaffer According to #104031, process data will be available under the processes field.
The old format:

process: ProcessInfo;

New format:

processes: ProcessInfo[];

matschaffer · 2021-11-24T05:11:19Z

I'm proposing closing this in favor of #119560 - will leave final call to core team (cc @Bamieh who should be bringing it back to core team meetings).

matschaffer · 2021-12-02T02:52:45Z

Got a ping from @Bamieh that he and @lukeelmers are good with the proposal in #119560 - closing this as proposed.

lukeelmers added Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc Breaking Change required-for-8.0 This work is required to be done before 8.0 lands, bc it relates to a breaking change or similar. labels Jul 1, 2021

This was referenced Jul 1, 2021

Use Nodejs clustering to take advantage of multi-core hardware #68626

Closed

[core.metrics] Add support for multiple processes in ops metrics & stats API; deprecate process field #104031

Closed

afharo mentioned this issue Aug 24, 2021

[8.0] Kibana GET /api/stats breaking changes elastic/beats#27569

Open

trentm mentioned this issue Sep 7, 2021

add tests for apps using cluster module elastic/apm-agent-nodejs#1999

Open

sixstringcode mentioned this issue Sep 22, 2021

[Meta] 8.0 Kibana deprecations #109166

Closed

jasonrhodes mentioned this issue Sep 27, 2021

[Stack Monitoring] Adapt UI to processes metrics from kibana 8.0 #113149

Closed

Bamieh mentioned this issue Oct 15, 2021

[core.metrics] remove process field #115200

Closed

lukeelmers added impact:low Addressing this issue will have a low level of impact on the quality/strength of our product. loe:small Small Level of Effort labels Oct 29, 2021

lukeelmers removed the required-for-8.0 This work is required to be done before 8.0 lands, bc it relates to a breaking change or similar. label Oct 29, 2021

exalate-issue-sync bot added loe:medium Medium Level of Effort and removed loe:small Small Level of Effort labels Nov 1, 2021

matschaffer mentioned this issue Nov 24, 2021

Provide pre-aggregated metrics in /stats API #119560

Closed

matschaffer mentioned this issue Nov 25, 2021

[Stack Monitoring] Make sure SM queries can handle both process and processes metrics across 7.x and 8.x indices #117614

Closed

matschaffer closed this as completed Dec 2, 2021

matschaffer added the won't fix label Dec 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Breaking][core.metrics] Remove `process` field from ops metrics & /stats API #104124

[Breaking][core.metrics] Remove `process` field from ops metrics & /stats API #104124

lukeelmers commented Jul 1, 2021

elasticmachine commented Jul 1, 2021

Bamieh commented Jul 6, 2021

lukeelmers commented Jul 6, 2021

afharo commented Aug 16, 2021

afharo commented Aug 24, 2021

afharo commented Aug 24, 2021

matschaffer commented Aug 30, 2021

pgayvallet commented Sep 2, 2021

matschaffer commented Sep 7, 2021

trentm commented Sep 7, 2021

matschaffer commented Sep 8, 2021

pgayvallet commented Sep 8, 2021

matschaffer commented Sep 9, 2021

pgayvallet commented Sep 9, 2021

matschaffer commented Sep 9, 2021

matschaffer commented Oct 11, 2021

matschaffer commented Oct 11, 2021

mshustov commented Oct 11, 2021

matschaffer commented Nov 24, 2021

matschaffer commented Dec 2, 2021

[Breaking][core.metrics] Remove process field from ops metrics & /stats API #104124

[Breaking][core.metrics] Remove process field from ops metrics & /stats API #104124

Comments

lukeelmers commented Jul 1, 2021

Summary

Alternative

elasticmachine commented Jul 1, 2021

Bamieh commented Jul 6, 2021

lukeelmers commented Jul 6, 2021

afharo commented Aug 16, 2021

afharo commented Aug 24, 2021

afharo commented Aug 24, 2021

matschaffer commented Aug 30, 2021

pgayvallet commented Sep 2, 2021

matschaffer commented Sep 7, 2021

trentm commented Sep 7, 2021

matschaffer commented Sep 8, 2021

pgayvallet commented Sep 8, 2021

matschaffer commented Sep 9, 2021

pgayvallet commented Sep 9, 2021

matschaffer commented Sep 9, 2021

matschaffer commented Oct 11, 2021

matschaffer commented Oct 11, 2021

mshustov commented Oct 11, 2021

matschaffer commented Nov 24, 2021

matschaffer commented Dec 2, 2021

[Breaking][core.metrics] Remove `process` field from ops metrics & /stats API #104124

[Breaking][core.metrics] Remove `process` field from ops metrics & /stats API #104124