Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Record stage-level metrics when running benchmarks #847

Closed

Conversation

andygrove
Copy link
Contributor

@andygrove andygrove commented Sep 24, 2020

This PR updates the benchmark utility so that it records stage-level metrics.

This closes #831

Example output from the unit test:

  "stageMetrics" : [ {
    "stageId" : 0,
    "taskCount" : 12,
    "stageMetrics" : {
      "duration" : 41,
      "internal.metrics.executorDeserializeCpuTime" : 2240825501,
      "internal.metrics.jvmGCTime" : 1140,
      "internal.metrics.resultSerializationTime" : 2,
      "number of output rows" : 100,
      "internal.metrics.resultSize" : 18765,
      "internal.metrics.executorDeserializeTime" : 6004,
      "internal.metrics.input.recordsRead" : 100,
      "internal.metrics.executorRunTime" : 545,
      "internal.metrics.executorCpuTime" : 116331531
    },
    "taskMetrics" : {
      "resultSerializationTime" : 2,
      "peakExecutionMemory" : 0,
      "diskBytesSpilled" : 0,
      "executorDeserializeTime" : 6004,
      "executorDeserializeCpuTime" : 2240825501,
      "jvmGCTime" : 1140,
      "memoryBytesSpilled" : 0,
      "executorCpuTime" : 116331531,
      "executorRunTime" : 545,
      "resultSize" : 18765
    }
  } ],

I have also testes with TPC queries and the output looks good.

Signed-off-by: Andy Grove <andygrove@nvidia.com>
Signed-off-by: Andy Grove <andygrove@nvidia.com>
Signed-off-by: Andy Grove <andygrove@nvidia.com>
Signed-off-by: Andy Grove <andygrove@nvidia.com>
Signed-off-by: Andy Grove <andygrove@nvidia.com>
…lso fix off-by-one error with nextId when generating DOT graphs

Signed-off-by: Andy Grove <andygrove@nvidia.com>
Signed-off-by: Andy Grove <andygrove@nvidia.com>
@andygrove andygrove added the test Only impacts tests label Sep 24, 2020
@andygrove andygrove added this to the Sep 14 - Sep 25 milestone Sep 24, 2020
@andygrove andygrove self-assigned this Sep 24, 2020
Signed-off-by: Andy Grove <andygrove@nvidia.com>
@andygrove andygrove changed the title [WIP] Record stage-level metrics when running benchmarks Record stage-level metrics when running benchmarks Sep 25, 2020
@andygrove
Copy link
Contributor Author

build

Signed-off-by: Andy Grove <andygrove@nvidia.com>
@andygrove
Copy link
Contributor Author

build

@tgravescs
Copy link
Collaborator

build

Signed-off-by: Andy Grove <andygrove@nvidia.com>
@andygrove
Copy link
Contributor Author

build

if (i+1 == iterations) {
spark.listenerManager.register(new BenchmarkListener(queryPlansWithMetrics))
spark.sparkContext.addSparkListener(new BenchSparkListener(stageMetrics))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have BenchmarkListener and BenchSparkListener, both of which I'm not sure what are from just the names. Perhaps we should give the more meaningful names. I get that they might be used for more then just stage metrics for instance, but would be nice to differentiate them somehow or use a common one.

@@ -549,6 +556,43 @@ class BenchmarkListener(list: ListBuffer[SparkPlanNode]) extends QueryExecutionL
}
}

class BenchSparkListener(executionMetrics: ListBuffer[StageMetrics]) extends SparkListener {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps rename executionMetrics to have stage in the name to be more clear

@@ -584,6 +629,15 @@ case class SparkSQLMetric(
metricType: String,
value: Any)

/** Summary of stage-level metrics */
case class StageMetrics(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - its a little bit confusing with the name of this because it looks like a Spark name. like StageInfo. I wonder if we should add something to the name just to be easier to differentiate. Its not that big of a deal though.

val taskMetrics = stageInfo.taskMetrics

val stageMetrics = stageInfo.accumulables.map(acc => Try {
val name = acc._2.name.getOrElse("")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure the usage of Try in this case? What benefit does it add because you have the getORElse - are you expecting _2 to be null?
Also not sure how much it happens but if they don't have names and multiple accumulators are "" we are going to lose that information, perhaps we should use id

name -> value
}).filter(_.isSuccess)
.map(_.get)
.filter(_._1.nonEmpty)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also here why do we filter out ones without names? the api allows you to create one without a name I would say we just pass it along with id. I guess maybe we are thinking about only ones we create will always have names?

"peakExecutionMemory" -> taskMetrics.peakExecutionMemory
)

executionMetrics += StageMetrics(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we also want to keep if the stage failed?

@andygrove andygrove changed the title Record stage-level metrics when running benchmarks [WIP] Record stage-level metrics when running benchmarks Oct 1, 2020
@andygrove
Copy link
Contributor Author

@tgravescs I'm putting this on hold until I've explored capturing the Spark event logs since that may be a more appropriate solution

@andygrove andygrove closed this Oct 1, 2020
@andygrove andygrove deleted the record-stage-task-metrics branch December 17, 2020 15:27
tgravescs pushed a commit to tgravescs/spark-rapids that referenced this pull request Nov 30, 2023
* Default build image use cuda 11.8

Signed-off-by: Peixin Li <pxli@nyu.edu>

* fix invalid comments

Signed-off-by: Peixin Li <pxli@nyu.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
test Only impacts tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Comparison of task deserialization metrics as part of benchmarking
3 participants