Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Telemetry] Instrument APM around snapshot telemetry generation #135922

Open
Tracked by #119466
afharo opened this issue Jul 7, 2022 · 4 comments
Open
Tracked by #119466

[Telemetry] Instrument APM around snapshot telemetry generation #135922

afharo opened this issue Jul 7, 2022 · 4 comments
Labels
Feature:Telemetry performance Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc

Comments

@afharo
Copy link
Member

afharo commented Jul 7, 2022

We want to minimize the impact the telemetry collection has on production clusters. To have more visibility around them, let's instrument some APM transactions and spans when running each collector to measure and identify the most costly actions performed by the usage collectors.

Relevant piece of code:

public bulkFetch = async (
esClient: ElasticsearchClient,
soClient: SavedObjectsClientContract,
collectors: Map<string, AnyCollector> = this.collectors
) => {
this.logger.debug(`Getting ready collectors`);
const getMarks = createPerformanceObsHook();
const { readyCollectors, nonReadyCollectorTypes, timedOutCollectorsTypes } =
await this.getReadyCollectors(collectors);
// freeze object to prevent collectors from mutating it.
const context = Object.freeze({ esClient, soClient });
const fetchExecutions = await Promise.all(
readyCollectors.map(async (collector) => {
const wrappedPromise = perfTimerify(
`fetch_${collector.type}`,
async () => await this.fetchCollector(collector, context)
);
return await wrappedPromise();
})
);
const durationMarks = getMarks();

@afharo afharo added Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc performance Feature:Telemetry labels Jul 7, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-core (Team:Core)

@afharo
Copy link
Member Author

afharo commented Jul 7, 2022

@lizozom, given your experience with APM instrumentation, it'd be great if you could advise us on the best way to instrument these: one transaction usage_collection and one span per collector? multiple transactions?

Bear in mind collectors run in parallel, does that affect the spans/transactions (ie: will they close each other)?

@rudolf
Copy link
Contributor

rudolf commented Jul 7, 2022

APM can help us see why the wall time is slow (e.g. because there's several request running serially) but it cannot answer the question "why is ... making Kibana slow". When it comes to telemetry the wall time is not the problem, the problem is that it sometimes blocks the CPU and causes slow event loops.

@afharo
Copy link
Member Author

afharo commented Jul 8, 2022

When it comes to telemetry the wall time is not the problem, the problem is that it sometimes blocks the CPU and causes slow event loops.

I'd say it's both: we received claims that it could also affect ES' performance (#93770), even though we've demonstrated most of the time it's just highlighting another underlying problem.

Re event loop delays, it looks like APM also monitors that (elastic/apm-agent-nodejs#1053).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Telemetry performance Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc
Projects
None yet
Development

No branches or pull requests

3 participants