Prometheus metrics #504

thinkharderdev · 2022-11-08T15:08:51Z

Which issue does this PR close?

Closes #427

Rationale for this change

We should expose metrics from the scheduler in a standard format. This PR adds an initial integration with prometheus to expose some basic scheduler metrics

What changes are included in this PR?

Add a new trait SchedulerMetricsCollector which is plugged into the core event loop to capture metrics
Add some additional metadata to the job events so we can track durations of the different parts of the query lifecycle (queued, planning, execution) separately.
Add an implementation of SchedulerMetricsCollector for prometheus
Add an API endpoint /api/metrics which will expose metrics.
Add some baseline metrics

In addition, I did some refactoring to make the core event loop more testable. Currently you can't really test it rigorously at all which is not great since it is a rather critical piece of code. I added a new trait TaskLauncher which can control how the TaskManager which actually launch tasks and various crate-private methods to plug in a custom launcher. I don't imagine this will ever be something exposed through a public API but it is very useful for testing.

Using the TaskLauncher I added some additional utilities to write scheduler test more succinctly.

Are there any user-facing changes?

Users can plug in their own metrics collector if they want.

No

… refactoring for testability

thinkharderdev · 2022-11-08T15:10:44Z

ballista/scheduler/src/metrics/prometheus.rs

+        let execution_time = register_histogram_with_registry!(
+            "query_time_seconds",
+            "Histogram of query execution time in seconds",
+            vec![0.5_f64, 1_f64, 5_f64, 30_f64, 60_f64],
+            registry
+        )
+        .map_err(|e| {
+            BallistaError::Internal(format!("Error registering metric: {:?}", e))
+        })?;
+
+        let planning_time = register_histogram_with_registry!(
+            "planning_time_ms",
+            "Histogram of query planning time in milliseconds",
+            vec![1.0_f64, 5.0_f64, 25.0_f64, 100.0_f64, 500.0_f64],
+            registry
+        )
+        .map_err(|e| {
+            BallistaError::Internal(format!("Error registering metric: {:?}", e))
+        })?;


These histograms are hard to generalize since use cases could vary wildly. We can either:

Make these configurable

Try and choose sensible defaults and if someone needs to customize then they can implement a custom collector

thinkharderdev · 2022-11-08T15:13:40Z

ballista/scheduler/src/state/task_manager.rs

-            let value = self.encode_execution_graph(graph.clone())?;
+
+            let value = encode_protobuf(&graph.status())?;


First bug uncovered while testing the event loop proper :) When a job would fail we would save the entire graph into FailedJobs here but in all other scenarios we only save the status

andygrove · 2022-11-08T18:15:58Z

This is awesome! Thanks @thinkharderdev. I will make time to try this out later this week.

thinkharderdev · 2022-11-08T19:02:30Z

I think this is ready for review. I had to tweak how we do the API routing slightly because the prometheus exporter probably won't send an application/json Accept header and I'm not sure it is guaranteed to send any particular Accept header so changed the routing to look at the path instead. But I was able to verify the metrics are exported.

andygrove · 2022-11-10T14:56:29Z

I filed #507 for writing documentation in the user guide for this feature.

andygrove · 2022-11-10T15:33:15Z

I pulled this branch and ran into a build issue:

  = note: /home/andy/miniconda3/envs/dask-sql/bin/../lib/gcc/x86_64-conda-linux-gnu/12.1.0/../../../../x86_64-conda-linux-gnu/bin/ld: /home/andy/git/apache/arrow-ballista/target/release/deps/libprocfs-372835ba76b41348.rlib(procfs-372835ba76b41348.procfs.efebeec9-cgu.3.rcgu.o): in function `procfs::CpuTime::from_str':
          procfs.efebeec9-cgu.3:(.text._ZN6procfs7CpuTime8from_str17hf1b21a01a95d1a1cE+0x7b): undefined reference to `getauxval'

It is building OK in CI though so I am confused. Any ideas @thinkharderdev?

thinkharderdev · 2022-11-10T16:34:36Z

Hmm... I haven't seen this locally but is looks like something when building the prometheus-rs crate. What OS are you using?

thinkharderdev · 2022-11-10T16:35:54Z

I filed #507 for writing documentation in the user guide for this feature.

Awesome. I'll have another PR shortly that will add the pending tasks queue tracking. I can add some initial documentation as part of that PR.

andygrove · 2022-11-10T22:00:33Z

Hmm... I haven't seen this locally but is looks like something when building the prometheus-rs crate. What OS are you using?

Ubuntu 20.04.4 LTS

thinkharderdev · 2022-11-11T16:31:10Z

@andygrove This PR is superseded by #511 which has some tweaks to make the plug-ability more useful so if we're good with the other PR we should just merge that one and I can close this one. I'll leave it open for the moment in case it's easier to review them separately.

thinkharderdev added 2 commits November 7, 2022 15:29

Add prometheus metrics for scheduler

5b950e8

Add SchedulerMetricsCollector, implementation for Prometheus and some…

81e677e

… refactoring for testability

thinkharderdev requested review from andygrove and yahoNanJing November 8, 2022 15:08

thinkharderdev commented Nov 8, 2022

View reviewed changes

thinkharderdev added 4 commits November 8, 2022 10:25

Linting

1536353

apache heaeder

4b61e53

Linting again

a3c4061

Make sure collector is only initialized once

5ed961d

Do not rely on accept header for API routing

39f9552

thinkharderdev marked this pull request as ready for review November 8, 2022 19:00

andygrove mentioned this pull request Nov 10, 2022

Add user guide section on prometheus metrics #507

Closed

andygrove added this to the Ballista 0.10.0 milestone Nov 10, 2022

andygrove modified the milestones: Ballista 0.10.0, Ballista 0.11.0 Nov 10, 2022

thinkharderdev mentioned this pull request Nov 10, 2022

Add Prometheus metrics endpoint #511

Merged

andygrove closed this in #511 Nov 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prometheus metrics #504

Prometheus metrics #504

thinkharderdev commented Nov 8, 2022 •

edited

Loading

thinkharderdev Nov 8, 2022

thinkharderdev Nov 8, 2022

andygrove commented Nov 8, 2022

thinkharderdev commented Nov 8, 2022

andygrove commented Nov 10, 2022

andygrove commented Nov 10, 2022

thinkharderdev commented Nov 10, 2022

thinkharderdev commented Nov 10, 2022

andygrove commented Nov 10, 2022

thinkharderdev commented Nov 11, 2022

		let value = self.encode_execution_graph(graph.clone())?;

		let value = encode_protobuf(&graph.status())?;

Prometheus metrics #504

Prometheus metrics #504

Conversation

thinkharderdev commented Nov 8, 2022 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

thinkharderdev Nov 8, 2022

Choose a reason for hiding this comment

thinkharderdev Nov 8, 2022

Choose a reason for hiding this comment

andygrove commented Nov 8, 2022

thinkharderdev commented Nov 8, 2022

andygrove commented Nov 10, 2022

andygrove commented Nov 10, 2022

thinkharderdev commented Nov 10, 2022

thinkharderdev commented Nov 10, 2022

andygrove commented Nov 10, 2022

thinkharderdev commented Nov 11, 2022

thinkharderdev commented Nov 8, 2022 •

edited

Loading