-
Notifications
You must be signed in to change notification settings - Fork 232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Record stage-level metrics when running benchmarks #847
Conversation
Signed-off-by: Andy Grove <andygrove@nvidia.com>
Signed-off-by: Andy Grove <andygrove@nvidia.com>
Signed-off-by: Andy Grove <andygrove@nvidia.com>
Signed-off-by: Andy Grove <andygrove@nvidia.com>
Signed-off-by: Andy Grove <andygrove@nvidia.com>
…lso fix off-by-one error with nextId when generating DOT graphs Signed-off-by: Andy Grove <andygrove@nvidia.com>
Signed-off-by: Andy Grove <andygrove@nvidia.com>
Signed-off-by: Andy Grove <andygrove@nvidia.com>
build |
Signed-off-by: Andy Grove <andygrove@nvidia.com>
build |
build |
Signed-off-by: Andy Grove <andygrove@nvidia.com>
build |
if (i+1 == iterations) { | ||
spark.listenerManager.register(new BenchmarkListener(queryPlansWithMetrics)) | ||
spark.sparkContext.addSparkListener(new BenchSparkListener(stageMetrics)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have BenchmarkListener and BenchSparkListener, both of which I'm not sure what are from just the names. Perhaps we should give the more meaningful names. I get that they might be used for more then just stage metrics for instance, but would be nice to differentiate them somehow or use a common one.
@@ -549,6 +556,43 @@ class BenchmarkListener(list: ListBuffer[SparkPlanNode]) extends QueryExecutionL | |||
} | |||
} | |||
|
|||
class BenchSparkListener(executionMetrics: ListBuffer[StageMetrics]) extends SparkListener { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps rename executionMetrics to have stage in the name to be more clear
@@ -584,6 +629,15 @@ case class SparkSQLMetric( | |||
metricType: String, | |||
value: Any) | |||
|
|||
/** Summary of stage-level metrics */ | |||
case class StageMetrics( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit - its a little bit confusing with the name of this because it looks like a Spark name. like StageInfo. I wonder if we should add something to the name just to be easier to differentiate. Its not that big of a deal though.
val taskMetrics = stageInfo.taskMetrics | ||
|
||
val stageMetrics = stageInfo.accumulables.map(acc => Try { | ||
val name = acc._2.name.getOrElse("") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure the usage of Try in this case? What benefit does it add because you have the getORElse - are you expecting _2 to be null?
Also not sure how much it happens but if they don't have names and multiple accumulators are "" we are going to lose that information, perhaps we should use id
name -> value | ||
}).filter(_.isSuccess) | ||
.map(_.get) | ||
.filter(_._1.nonEmpty) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also here why do we filter out ones without names? the api allows you to create one without a name I would say we just pass it along with id. I guess maybe we are thinking about only ones we create will always have names?
"peakExecutionMemory" -> taskMetrics.peakExecutionMemory | ||
) | ||
|
||
executionMetrics += StageMetrics( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we also want to keep if the stage failed?
@tgravescs I'm putting this on hold until I've explored capturing the Spark event logs since that may be a more appropriate solution |
* Default build image use cuda 11.8 Signed-off-by: Peixin Li <pxli@nyu.edu> * fix invalid comments Signed-off-by: Peixin Li <pxli@nyu.edu>
This PR updates the benchmark utility so that it records stage-level metrics.
This closes #831
Example output from the unit test:
I have also testes with TPC queries and the output looks good.