AggregatorRegistry assumes all workers will have metrics setup #181

orestis · 2018-03-26T15:15:21Z

Hello,

thanks for prom-client, it's a huge time saver for us!

I have run into an issue with the cluster aggregator support. It seems that unless all the workers of a node cluster are setup with prom-client, the aggregator won't work.

The issue seems to in

prom-client/lib/cluster.js

Lines 41 to 74 in bb09c6d

    
           const nWorkers = Object.keys(cluster.workers).length; 
        
           function done(err, result) { 
        
           	// Don't resolve/reject the promise if a callback is provided 
        
           	if (typeof callback === 'function') { 
        
           		callback(err, result); 
        
           	} else { 
        
           		if (err) reject(err); 
        
           		else resolve(result); 
        
           	} 
        
           } 
        
           if (nWorkers === 0) { 
        
           	return process.nextTick(() => done(null, '')); 
        
           } 
        
           const request = { 
        
           	responses: [], 
        
           	pending: nWorkers, 
        
           	done, 
        
           	errorTimeout: setTimeout(() => { 
        
           		request.failed = true; 
        
           		const err = new Error('Operation timed out.'); 
        
           		request.done(err); 
        
           	}, 5000), 
        
           	failed: false 
        
           }; 
        
           requests.set(requestId, request); 
        
           const message = { 
        
           	type: GET_METRICS_REQ, 
        
           	requestId 
        
           }; 
        
           for (const id in cluster.workers) cluster.workers[id].send(message);

We have workers in our node cluster that do not import prom-client for various reasons, hence I need a way to instruct prom-client to not bother contacting them.

I'm happy to provide a PR for this, but wanted to present possible solutions first:

Instead of rejecting the promise when a timeout happens, silently ignore the timeout.
Same as (1) but report non-responsive workers as a separate metric for monitoring.
Instead of sending the GET_METRICS_REQ to all the workers, maintain a list of workers that have setup the listeners and send only to those.

Thoughts?

The text was updated successfully, but these errors were encountered:

SimenB · 2018-03-26T15:50:01Z

@zbjornson thoughts?

zbjornson · 2018-03-26T17:22:37Z

Interesting situation...

Re: your possible solutions:

This would silently create wrong metric values where you were expecting workers to respond. E.g. if you were summing CPU or memory usage, you'd see artificial dips that are actually because a worker didn't report.
Seems like a hassle to monitor this auxiliary metric...
Possibly, depending on how it's done. If you did this so that on boot the workers message the master to indicate that they opt-in to metrics, users of this lib would have to ensure that they construct the AggregatorRegistry before forking workers. The opposite would be if there was a discovery phase (default to, say, one minute?) during which the master polls workers to see if they've setup listeners, and if they never issue a response during the discovery phase, don't poll them in the future.

I'll add another option:

(1) but opt-in, something like registry.clusterMetrics({allowPartialAggregation: true}).

orestis · 2018-03-26T17:34:36Z

I've done a quick implemention for (3) here: #182 -- not ready to merge yet but just to foster discussion.

I didn't think of the ordering issues (create registry then fork vs opposite). I'll think about it.

orestis · 2018-03-27T08:29:53Z

I'm trying to think how can this be done in a way that the consumer of the library has control over the initialisation process, so thinking a bit aloud here:

The current implementation that contacts all workers has no ordering issue, but suffers when workers are by design non-responsive. However, it still depends on cluster.js being required in all the workers. This is currently done by the top-level index.js, hence require('prom-client') does it.
The implementation at ensure that only registered cluster workers are asked to report metrics #182 expects the AggregatorRegistry to be created in the master before forking, and also that clients require prom-client.
I think it's safe to assume that since workers interested in metrics must require prom-client anyway, I think that indeed setting up the cluster support in workers should be implicit as is now. This way worker code doesn't have to worry about if it's running in a cluster or not.
The master process of a cluster must explicitly opt-in to using the AggregatorRegistry and exposing the metrics via a web endpoint. Hence the master code is already aware of the cluster in play. In our codebase we try to abstract this away but we do have to check for cluster.isMaster and switch to a different behaviour.
The simple approach is to require users of the library to create the AggregatorRegistry before forking (as mentioned already). However, I understand that this will break existing code that expected things to be automatic.
Having a discovery phase etc might be a good long term solution that will deal with workers dying. I've already added some code that tries to do the right thing but I haven't tested it thoroughly.
Perhaps in the interests of backwards compatibility, this new functionality could be put under some option or some different class altogether e.g. CoordinatedAggregatorRegistry. Users of the current AggregatorRegistry will still get the previous behaviour.
I will test a bit more against our codebase and report back.

orestis · 2018-03-27T10:31:16Z

I've updated the PR, this is now deployed in our staging environment and seems to be ticking along nicely. If the general approach is accepted, I can write some tests covering this new option.

orestis mentioned this issue Mar 26, 2018

ensure that only registered cluster workers are asked to report metrics #182

Open

ottomata mentioned this issue Jul 15, 2021

Always registering global event handlers causes issues with multiple require('prom-client'). #450

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AggregatorRegistry assumes all workers will have metrics setup #181

AggregatorRegistry assumes all workers will have metrics setup #181

orestis commented Mar 26, 2018

SimenB commented Mar 26, 2018

zbjornson commented Mar 26, 2018

orestis commented Mar 26, 2018

orestis commented Mar 27, 2018

orestis commented Mar 27, 2018

AggregatorRegistry assumes all workers will have metrics setup #181

AggregatorRegistry assumes all workers will have metrics setup #181

Comments

orestis commented Mar 26, 2018

SimenB commented Mar 26, 2018

zbjornson commented Mar 26, 2018

orestis commented Mar 26, 2018

orestis commented Mar 27, 2018

orestis commented Mar 27, 2018