-
Notifications
You must be signed in to change notification settings - Fork 185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prometheus metrics #504
Prometheus metrics #504
Conversation
… refactoring for testability
let execution_time = register_histogram_with_registry!( | ||
"query_time_seconds", | ||
"Histogram of query execution time in seconds", | ||
vec![0.5_f64, 1_f64, 5_f64, 30_f64, 60_f64], | ||
registry | ||
) | ||
.map_err(|e| { | ||
BallistaError::Internal(format!("Error registering metric: {:?}", e)) | ||
})?; | ||
|
||
let planning_time = register_histogram_with_registry!( | ||
"planning_time_ms", | ||
"Histogram of query planning time in milliseconds", | ||
vec![1.0_f64, 5.0_f64, 25.0_f64, 100.0_f64, 500.0_f64], | ||
registry | ||
) | ||
.map_err(|e| { | ||
BallistaError::Internal(format!("Error registering metric: {:?}", e)) | ||
})?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These histograms are hard to generalize since use cases could vary wildly. We can either:
- Make these configurable
- Try and choose sensible defaults and if someone needs to customize then they can implement a custom collector
let value = self.encode_execution_graph(graph.clone())?; | ||
|
||
let value = encode_protobuf(&graph.status())?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First bug uncovered while testing the event loop proper :) When a job would fail we would save the entire graph into FailedJobs
here but in all other scenarios we only save the status
This is awesome! Thanks @thinkharderdev. I will make time to try this out later this week. |
I think this is ready for review. I had to tweak how we do the API routing slightly because the prometheus exporter probably won't send an |
I filed #507 for writing documentation in the user guide for this feature. |
I pulled this branch and ran into a build issue:
It is building OK in CI though so I am confused. Any ideas @thinkharderdev? |
Hmm... I haven't seen this locally but is looks like something when building the prometheus-rs crate. What OS are you using? |
Awesome. I'll have another PR shortly that will add the pending tasks queue tracking. I can add some initial documentation as part of that PR. |
Ubuntu 20.04.4 LTS |
@andygrove This PR is superseded by #511 which has some tweaks to make the plug-ability more useful so if we're good with the other PR we should just merge that one and I can close this one. I'll leave it open for the moment in case it's easier to review them separately. |
Which issue does this PR close?
Closes #427
Rationale for this change
We should expose metrics from the scheduler in a standard format. This PR adds an initial integration with prometheus to expose some basic scheduler metrics
What changes are included in this PR?
SchedulerMetricsCollector
which is plugged into the core event loop to capture metricsSchedulerMetricsCollector
for prometheus/api/metrics
which will expose metrics.In addition, I did some refactoring to make the core event loop more testable. Currently you can't really test it rigorously at all which is not great since it is a rather critical piece of code. I added a new trait
TaskLauncher
which can control how theTaskManager
which actually launch tasks and various crate-private methods to plug in a custom launcher. I don't imagine this will ever be something exposed through a public API but it is very useful for testing.Using the
TaskLauncher
I added some additional utilities to write scheduler test more succinctly.Are there any user-facing changes?
Users can plug in their own metrics collector if they want.
No