Some metrics improvements and timeline reporting #4451

revans2 · 2022-01-03T14:46:57Z

This is a result of trying to find a heuristic to optimize parquet/orc splits and also looking at buffering times to try and understand if there are more optimizations we could do for HDFS/other distributed file systems.

It fixes some metrics and offers a way to visualize the metrics in the timeline view from the profiling tool. The colors on the timeline were already really bad, and this does not help at all.

I don't consider this completely done because I have not documented the metrics reporting yet. I have not done this because I wasn't sure if the colors I have picked are okay. Also I was not sure if we wanted to put in a pattern in addition to a color to make it simpler to see. I also am not sure if this is something we want to have on by default, especially because the semaphore time only happens when debug metrics are enabled. Here is a high level overview.

The bottom half of each task shows the amount of time taken as reported by various metrics.

yellow is the deserialization time for the task as reported by Spark. This works on both CPU and GPU tasks.
white is the read time for a task. This is a combination of the "buffer time" SQL metric and the shuffle read time as reported by Spark. The shuffle data works on both CPU and GPU, but the buffer time metric is GPU only.
red is the semaphore wait time. This only shows up on GPU tasks when DEBUG metrics are enabled. It does not apply to CPU tasks.
green is the "op time" SQL metrics. This is GPU task specific. I am also a little concerned about this because I have seen it be longer than the total time for a task. I fixed one issue with it where the op time included the semaphore time for shuffle coalesce. But I can still see it for tasks with lots of large joins in them.
blue is the write time for a task. This is the "write time" SQL metric and the shuffle write time as reported by Spark. Like with the read time the shuffle metrics work for both GPU and CPU, but the write time metrics are GPU specific.

feedback is appreciated.

Signed-off-by: Robert (Bobby) Evans <bobby@apache.org>

gerashegalov

LGTM, minor comments

gerashegalov · 2022-01-03T20:44:50Z

tools/src/main/scala/com/nvidia/spark/rapids/tool/profiling/GenerateTimeline.scala

+      yStart: Long,
+      minStart: Long,
+      fileWriter: ToolTextFileWriter): Unit = {
+    val x = xStart + (startTime - minStart)/MS_PER_PIXEL


nit: spaces around /

gerashegalov · 2022-01-03T20:49:13Z

docs/tuning-guide.md

+| Key              | Name                     | Description                                                                                                                                                                         |
+|------------------|--------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| bufferTime       | buffer time              | Time spent buffering input from file data sources. This buffering time happens on the CPU, typically with no GPU semaphore held.                                                    |
+| readFsTime       | time to read fs data     | Time spent actually reading the data and writing it to on heap memory. This is a part of `bufferTime`                                                                               |


hyphenated spelling on-heap , off-heap is easier to parse

revans2 · 2022-01-04T16:20:48Z

build

revans2 · 2022-01-04T16:40:32Z

Converting to draft because I found some issues with op time for join that I want to understand better.

revans2 · 2022-01-04T18:53:50Z

build

docs/spark-profiling-tool.md

sql-plugin/src/main/scala/com/nvidia/spark/rapids/AbstractGpuJoinIterator.scala

tools/src/main/scala/com/nvidia/spark/rapids/tool/profiling/GenerateTimeline.scala

sql-plugin/src/main/scala/com/nvidia/spark/rapids/ColumnarOutputWriter.scala

jlowe · 2022-01-04T20:26:59Z

build

revans2 added 4 commits December 30, 2021 15:05

Added in more DEBUG buffering metrics

44a0c58

Signed-off-by: Robert (Bobby) Evans <bobby@apache.org>

Update timeline to include some metrics

4d8d215

Addes shuffle metrics to the timings

ffb10c8

Fix a metric and clean up timeline a little

3869bde

revans2 added task Work required that improves the product but is not user facing tools labels Jan 3, 2022

revans2 added this to the Dec 13 - Jan 7 milestone Jan 3, 2022

revans2 self-assigned this Jan 3, 2022

Update comment

ad5ff46

gerashegalov previously approved these changes Jan 3, 2022

View reviewed changes

revans2 added 2 commits January 4, 2022 09:53

Merge branch 'branch-22.02' into read_metrics

a8d3094

Addressed review comments and added docs

68605f2

revans2 dismissed gerashegalov’s stale review via 68605f2 January 4, 2022 16:20

revans2 marked this pull request as draft January 4, 2022 16:39

Fixed metrics for join and aggregate

18c645c

revans2 marked this pull request as ready for review January 4, 2022 18:53

jlowe reviewed Jan 4, 2022

View reviewed changes

revans2 added 2 commits January 4, 2022 14:19

Fixed spelling

9883e37

Update copyright

5443d83

jlowe approved these changes Jan 4, 2022

View reviewed changes

revans2 merged commit b3d37ae into NVIDIA:branch-22.02 Jan 5, 2022

revans2 deleted the read_metrics branch January 5, 2022 12:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some metrics improvements and timeline reporting #4451

Some metrics improvements and timeline reporting #4451

revans2 commented Jan 3, 2022

gerashegalov left a comment

gerashegalov Jan 3, 2022

gerashegalov Jan 3, 2022

revans2 commented Jan 4, 2022

revans2 commented Jan 4, 2022

revans2 commented Jan 4, 2022

jlowe commented Jan 4, 2022

Some metrics improvements and timeline reporting #4451

Some metrics improvements and timeline reporting #4451

Conversation

revans2 commented Jan 3, 2022

gerashegalov left a comment

Choose a reason for hiding this comment

gerashegalov Jan 3, 2022

Choose a reason for hiding this comment

gerashegalov Jan 3, 2022

Choose a reason for hiding this comment

revans2 commented Jan 4, 2022

revans2 commented Jan 4, 2022

revans2 commented Jan 4, 2022

jlowe commented Jan 4, 2022