Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Breaking][core.metrics] Remove process field from ops metrics & /stats API #104124

Closed
lukeelmers opened this issue Jul 1, 2021 · 20 comments
Closed
Labels
Breaking Change impact:low Addressing this issue will have a low level of impact on the quality/strength of our product. loe:medium Medium Level of Effort Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc won't fix

Comments

@lukeelmers
Copy link
Member

Part of #68626
Blocked by #104031

Summary

Once #104031 is complete and the existing process field has been deprecated from our ops metrics & /stats APIs, we'll need to remove it. This is necessary to support a multi-process Kibana as outlined in the clustering RFC: #94057

Alternative

As outlined in the RFC, there is a way that we could keep the existing field in place and avoid breaking it in 8.0, but it has significant downsides.

The alternative would essentially be aggregating each of the metrics based on data reported from each worker:

{
  // ...
  "process": {
    "memory": {
      "heap": {
        "total_bytes": 533581824, // sum of coordinator + workers
        "used_bytes": 296297424, // sum of coordinator + workers
        "size_limit": 4345298944 // sum of coordinator + workers
      },
      "resident_set_size_bytes": 563625984 // sum of coordinator + workers
    },
    "pid": 52646, // pid of the coordinator
    "event_loop_delay": 0.22967800498008728, // max of coordinator + workers
    "uptime_ms": 1706021.930404 // uptime of the coordinator
  },
  // ...
}
  • size_limit in particular could be confusing
  • sum of available/used heap & node rss is straightforward
  • event_loop_delay max mostly makes sense, as we are mostly only interested in that number if it is high anyway
  • pid and uptime_in_millis from the coordinator make sense, especially as long as we are killing
    all workers any time one of them dies. However, in the future if we respawn workers that die, this could be
    misleading.

Overall, this approach means the data are less accurate and misleading, and are therefore less helpful for users. That's why making the breaking change in 8.0 would be preferred.

@lukeelmers lukeelmers added Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc Breaking Change required-for-8.0 This work is required to be done before 8.0 lands, bc it relates to a breaking change or similar. labels Jul 1, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-core (Team:Core)

@Bamieh
Copy link
Member

Bamieh commented Jul 6, 2021

The current event_loop_delay is not really useful since the delay is captured only at collection time rather than an ongoing process.

Our new event loop delay collector captures this metric more accurately in a histogram throughout the duration of the process. we might wawnt to rely on that approach or move it around if needed. Related: #98673

@lukeelmers
Copy link
Member Author

Our new event loop delay collector captures this metric more accurately in a histogram throughout the duration of the process. we might wawnt to rely on that approach or move it around if needed

👍 Added a note on this to the issue where we are discussing the new format (#104031)

@afharo
Copy link
Member

afharo commented Aug 16, 2021

We might need to sync with Metricbeat and Monitoring when applying these changes.

@afharo
Copy link
Member

afharo commented Aug 24, 2021

I created this issue for the Metricbeat team to track them: elastic/beats#27569

@afharo
Copy link
Member

afharo commented Aug 24, 2021

Also adding that this may be a breaking change to GET /api/status since it also shares the Ops Metrics.

@matschaffer
Copy link
Contributor

It looks like

field: 'process.memory.resident_set_size_in_bytes',
has a reference to the process object to retrieve RSS information.

Do we have an alternate field to get information about how much memory kibana is using?

@pgayvallet
Copy link
Contributor

Do we have an alternate field to get information about how much memory kibana is using?

The alternative would essentially be aggregating each of the metrics based on data reported from each worker

As said in the issue's description, aggregating some of these values don't really make sense. However, for monitoring purposes, maybe we should introduce the aggregated values for a subset of the memory information? the RSS being the most obvious example, as we can imagine monitoring processes wanted to know the total amount of memory Kibana is consuming. As this feature is a breaking change anyway, we should probably use a new section for that, such as aggregatedData, or something?

@matschaffer
Copy link
Contributor

I think in the end it'll be important for kibana developers and administrators to know how the app is using memory.

Since this is probably common for any node.js process, I wonder what @elastic/apm-agent-node-js team has explored in that space.

Having a common approach between both APM & metricbeat could be beneficial down the road.

@trentm
Copy link
Member

trentm commented Sep 7, 2021

I wonder what @elastic/apm-agent-node-js team has explored in that space.

Nothing much at all, unfortunately.

The Node.js APM agent does no instrumentation specific to the cluster module or aggregation of metrics across separate cluster processes. Assuming the APM agent is started outside of cluster.is{Master|Primary|Worker} logic, there will be an APM agent in each process operating independently. Every metricsInterval (default 30s) the APM agent will report metrics like the following for each process:

    {
        "metadata": {
...
            "process": {
                "pid": 5606,
                "ppid": 5605,
...
    }
    {
        "metricset": {
            "samples": {
                "system.cpu.total.norm.pct": { "value": 0.022109233365979697 },
                "system.memory.total": { "value": 68719476736 },
                "system.memory.actual.free": { "value": 42007408640 },
                "system.process.cpu.total.norm.pct": { "value": 0.0005280573627318066 },
                "system.process.cpu.system.norm.pct": { "value": 0.0001569328440209994 },
                "system.process.cpu.user.norm.pct": { "value": 0.0003711245187108072 },
                "system.process.memory.rss.bytes": { "value": 36159488 },
                "nodejs.handles.active": { "value": 4 },
                "nodejs.requests.active": { "value": 0 },
                "nodejs.eventloop.delay.avg.ms": { "value": 1.6510719999999992 },
                "nodejs.memory.heap.allocated.bytes": { "value": 11169792 },
                "nodejs.memory.heap.used.bytes": { "value": 8776264 },
                "nodejs.memory.external.bytes": { "value": 1515422 },
                "nodejs.memory.arrayBuffers.bytes": { "value": 17655 }
            },
            "timestamp": 1631033459971000,
            "tags": {   // aka "labels" once imported to Elasticsearch
                "hostname": "pink.local",
                "env": "development"
            }
        }
    }

It might be useful to add a label (via globalLabels APM config) to metrics for whether the process is a primary (coordinator) or a worker.

@matschaffer
Copy link
Contributor

there will be an APM agent in each process operating independently.

Does the APM UI then represent each cluster process as it's own host? and if so how do they get named? I'd be curious to see it in the UI if possible.

On the kibana side do the workers have any sort of designation like maybe: web, worker, etc? Or are do we just get N generic kibana processes any time someone starts kibana?

Is there a way we can run kibana in cluster mode today to see what it looks like in the OS process table maybe?

Trying to wrap my head around how we can show a kibana admin real data about memory consumption without it being confusing to look at.

@pgayvallet
Copy link
Contributor

On the kibana side do the workers have any sort of designation like maybe: web, worker, etc?

We're not planning yet on having specialized workers. We'll have one coordinator, and N workers, with each working having a unique identifier. E.g in a 2-worker instance, we'll have coordinator, worker-1 and worker-2

@matschaffer
Copy link
Contributor

And (stepping back from APIs) how would an operator know if the memory usage of the coordinator and workers is reasonable?

Even just a bash example would be fine at this stage I think.

@pgayvallet
Copy link
Contributor

And (stepping back from APIs) how would an operator know if the memory usage of the coordinator and workers is reasonable?

As an operator, I'm not sure the memory usage of specific processes is really the first thing to monitor compared to the aggregated RSS from all processes? Which is why I was asking if it may make sense to expose some of the aggregated values (in #104124 (comment))

@matschaffer
Copy link
Contributor

aggregated RSS from all processes?

For sure. I could see a case for both. On one side you want to make sure the whole package isn't using more ram than you're willing to give it. On the other wide we probably want to be able to get performance information for each worker (especially kibana devs trying to help customers root out performance issues).

@matschaffer
Copy link
Contributor

@pgayvallet any news on what the new data from kibana is expected to look like?

It'd be great to have a way to exercise the new APIs so we can start working on metricbeat/UI changes required.

@matschaffer
Copy link
Contributor

One thought is that if the aggregated API doesn't seem feasible, if we have a way metricbeat can query each subprocess and send that to a monitoring cluster, we could aggregate at query time in the UI.

@mshustov
Copy link
Contributor

@pgayvallet any news on what the new data from kibana is expected to look like?

@matschaffer According to #104031, process data will be available under the processes field.
The old format:

process: ProcessInfo;

New format:

processes: ProcessInfo[];

@lukeelmers lukeelmers added impact:low Addressing this issue will have a low level of impact on the quality/strength of our product. loe:small Small Level of Effort labels Oct 29, 2021
@lukeelmers lukeelmers removed the required-for-8.0 This work is required to be done before 8.0 lands, bc it relates to a breaking change or similar. label Oct 29, 2021
@exalate-issue-sync exalate-issue-sync bot added loe:medium Medium Level of Effort and removed loe:small Small Level of Effort labels Nov 1, 2021
@matschaffer
Copy link
Contributor

I'm proposing closing this in favor of #119560 - will leave final call to core team (cc @Bamieh who should be bringing it back to core team meetings).

@matschaffer
Copy link
Contributor

Got a ping from @Bamieh that he and @lukeelmers are good with the proposal in #119560 - closing this as proposed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Breaking Change impact:low Addressing this issue will have a low level of impact on the quality/strength of our product. loe:medium Medium Level of Effort Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc won't fix
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants