New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[WIP] Record stage-level metrics when running benchmarks #847

Closed

andygrove wants to merge 11 commits into NVIDIA:branch-0.3 from andygrove:record-stage-task-metrics

Contributor

andygrove commented Sep 24, 2020 •

edited

Loading

This PR updates the benchmark utility so that it records stage-level metrics.

This closes #831

Example output from the unit test:

  "stageMetrics" : [ {
    "stageId" : 0,
    "taskCount" : 12,
    "stageMetrics" : {
      "duration" : 41,
      "internal.metrics.executorDeserializeCpuTime" : 2240825501,
      "internal.metrics.jvmGCTime" : 1140,
      "internal.metrics.resultSerializationTime" : 2,
      "number of output rows" : 100,
      "internal.metrics.resultSize" : 18765,
      "internal.metrics.executorDeserializeTime" : 6004,
      "internal.metrics.input.recordsRead" : 100,
      "internal.metrics.executorRunTime" : 545,
      "internal.metrics.executorCpuTime" : 116331531
    },
    "taskMetrics" : {
      "resultSerializationTime" : 2,
      "peakExecutionMemory" : 0,
      "diskBytesSpilled" : 0,
      "executorDeserializeTime" : 6004,
      "executorDeserializeCpuTime" : 2240825501,
      "jvmGCTime" : 1140,
      "memoryBytesSpilled" : 0,
      "executorCpuTime" : 116331531,
      "executorRunTime" : 545,
      "resultSize" : 18765
    }
  } ],

I have also testes with TPC queries and the output looks good.

andygrove added 8 commits

September 21, 2020 13:38


          record spark plan SQL metrics to JSON

ed56257

Signed-off-by: Andy Grove <andygrove@nvidia.com>


          utility for producing DOT graph comparing plan metrics

bdaccd0

Signed-off-by: Andy Grove <andygrove@nvidia.com>


          improve diff output

Signed-off-by: Andy Grove <andygrove@nvidia.com>


          fix imports

9fdaa50

Signed-off-by: Andy Grove <andygrove@nvidia.com>


          remove use of AtomicInteger

76c347b


          remove use of AtomicInteger

1402b54

Signed-off-by: Andy Grove <andygrove@nvidia.com>


          remove WholeStageCodegen and InputAdapter nodes before saving JSON. A…

df9ed3f

…lso fix off-by-one error with nextId when generating DOT graphs

Signed-off-by: Andy Grove <andygrove@nvidia.com>


          collect stage-level metrics when running benchmarks

94d2f2f

Signed-off-by: Andy Grove <andygrove@nvidia.com>

andygrove added the test label

andygrove added this to the Sep 14 - Sep 25 milestone

andygrove self-assigned this

andygrove mentioned this pull request

Record spark plan SQL metrics to JSON when running benchmarks #796

Merged


          merge from branch-0.3

a9685f1

Signed-off-by: Andy Grove <andygrove@nvidia.com>

andygrove changed the title ~~[WIP] Record stage-level metrics when running benchmarks~~ Record stage-level metrics when running benchmarks

Contributor Author

andygrove commented Sep 25, 2020

build


          store parent ids so we can recreate the dag when reporting

4f4a0f1

Signed-off-by: Andy Grove <andygrove@nvidia.com>

Contributor Author

andygrove commented Sep 25, 2020

build

sameerz modified the milestones: Sep 14 - Sep 25, Sep 28 - Oct 9

Collaborator

tgravescs commented Sep 30, 2020

build


          Remove duplicate line and trailing comma

7dfd4f3

Signed-off-by: Andy Grove <andygrove@nvidia.com>

Contributor Author

andygrove commented Sep 30, 2020

build

tgravescs reviewed

View reviewed changes

integration_tests/src/main/scala/com/nvidia/spark/rapids/tests/common/BenchUtils.scala

                     if (i+1 == iterations) {
                       spark.listenerManager.register(new BenchmarkListener(queryPlansWithMetrics))
+                      spark.sparkContext.addSparkListener(new BenchSparkListener(stageMetrics))

Collaborator

tgravescs Sep 30, 2020

we have BenchmarkListener and BenchSparkListener, both of which I'm not sure what are from just the names. Perhaps we should give the more meaningful names. I get that they might be used for more then just stage metrics for instance, but would be nice to differentiate them somehow or use a common one.

integration_tests/src/main/scala/com/nvidia/spark/rapids/tests/common/BenchUtils.scala

                 }
               }
+              class BenchSparkListener(executionMetrics: ListBuffer[StageMetrics]) extends SparkListener {

Collaborator

tgravescs Sep 30, 2020

perhaps rename executionMetrics to have stage in the name to be more clear

integration_tests/src/main/scala/com/nvidia/spark/rapids/tests/common/BenchUtils.scala

@@ @@ -584,6 +629,15 @@ case class SparkSQLMetric( @@
                   metricType: String,
                   value: Any)
+              /** Summary of stage-level metrics */
+              case class StageMetrics(

Collaborator

tgravescs Sep 30, 2020

nit - its a little bit confusing with the name of this because it looks like a Spark name. like StageInfo. I wonder if we should add something to the name just to be easier to differentiate. Its not that big of a deal though.

integration_tests/src/main/scala/com/nvidia/spark/rapids/tests/common/BenchUtils.scala

+                  val taskMetrics = stageInfo.taskMetrics
+                  val stageMetrics = stageInfo.accumulables.map(acc => Try {
+                    val name = acc._2.name.getOrElse("")

Collaborator

tgravescs Sep 30, 2020

I'm not sure the usage of Try in this case? What benefit does it add because you have the getORElse - are you expecting _2 to be null?
Also not sure how much it happens but if they don't have names and multiple accumulators are "" we are going to lose that information, perhaps we should use id

integration_tests/src/main/scala/com/nvidia/spark/rapids/tests/common/BenchUtils.scala

+                    name -> value
+                  }).filter(_.isSuccess)
+                      .map(_.get)
+                      .filter(_._1.nonEmpty)

Collaborator

tgravescs Sep 30, 2020

also here why do we filter out ones without names? the api allows you to create one without a name I would say we just pass it along with id. I guess maybe we are thinking about only ones we create will always have names?

integration_tests/src/main/scala/com/nvidia/spark/rapids/tests/common/BenchUtils.scala

+                    "peakExecutionMemory" -> taskMetrics.peakExecutionMemory
+                  )
+                  executionMetrics += StageMetrics(

Collaborator

tgravescs Sep 30, 2020

do we also want to keep if the stage failed?

andygrove changed the title ~~Record stage-level metrics when running benchmarks~~ [WIP] Record stage-level metrics when running benchmarks

Contributor Author

andygrove commented Oct 1, 2020

@tgravescs I'm putting this on hold until I've explored capturing the Spark event logs since that may be a more appropriate solution

andygrove closed this

andygrove deleted the record-stage-task-metrics branch

December 17, 2020 15:27

tgravescs pushed a commit to tgravescs/spark-rapids that referenced this pull request


          Default build image use cuda 11.8 (NVIDIA#847)

bd0c644

* Default build image use cuda 11.8

Signed-off-by: Peixin Li <pxli@nyu.edu>

* fix invalid comments

Signed-off-by: Peixin Li <pxli@nyu.edu>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test