Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some metrics improvements and timeline reporting #4451

Merged
merged 10 commits into from
Jan 5, 2022

Conversation

revans2
Copy link
Collaborator

@revans2 revans2 commented Jan 3, 2022

This is a result of trying to find a heuristic to optimize parquet/orc splits and also looking at buffering times to try and understand if there are more optimizations we could do for HDFS/other distributed file systems.

It fixes some metrics and offers a way to visualize the metrics in the timeline view from the profiling tool. The colors on the timeline were already really bad, and this does not help at all.

Screenshot from 2022-01-03 08-09-22

I don't consider this completely done because I have not documented the metrics reporting yet. I have not done this because I wasn't sure if the colors I have picked are okay. Also I was not sure if we wanted to put in a pattern in addition to a color to make it simpler to see. I also am not sure if this is something we want to have on by default, especially because the semaphore time only happens when debug metrics are enabled. Here is a high level overview.

The bottom half of each task shows the amount of time taken as reported by various metrics.

  • yellow is the deserialization time for the task as reported by Spark. This works on both CPU and GPU tasks.
  • white is the read time for a task. This is a combination of the "buffer time" SQL metric and the shuffle read time as reported by Spark. The shuffle data works on both CPU and GPU, but the buffer time metric is GPU only.
  • red is the semaphore wait time. This only shows up on GPU tasks when DEBUG metrics are enabled. It does not apply to CPU tasks.
  • green is the "op time" SQL metrics. This is GPU task specific. I am also a little concerned about this because I have seen it be longer than the total time for a task. I fixed one issue with it where the op time included the semaphore time for shuffle coalesce. But I can still see it for tasks with lots of large joins in them.
  • blue is the write time for a task. This is the "write time" SQL metric and the shuffle write time as reported by Spark. Like with the read time the shuffle metrics work for both GPU and CPU, but the write time metrics are GPU specific.

feedback is appreciated.

@revans2 revans2 added task Work required that improves the product but is not user facing tools labels Jan 3, 2022
@revans2 revans2 added this to the Dec 13 - Jan 7 milestone Jan 3, 2022
@revans2 revans2 self-assigned this Jan 3, 2022
gerashegalov
gerashegalov previously approved these changes Jan 3, 2022
Copy link
Collaborator

@gerashegalov gerashegalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, minor comments

yStart: Long,
minStart: Long,
fileWriter: ToolTextFileWriter): Unit = {
val x = xStart + (startTime - minStart)/MS_PER_PIXEL
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: spaces around /

| Key | Name | Description |
|------------------|--------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| bufferTime | buffer time | Time spent buffering input from file data sources. This buffering time happens on the CPU, typically with no GPU semaphore held. |
| readFsTime | time to read fs data | Time spent actually reading the data and writing it to on heap memory. This is a part of `bufferTime` |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hyphenated spelling on-heap , off-heap is easier to parse

@revans2
Copy link
Collaborator Author

revans2 commented Jan 4, 2022

build

@revans2 revans2 marked this pull request as draft January 4, 2022 16:39
@revans2
Copy link
Collaborator Author

revans2 commented Jan 4, 2022

Converting to draft because I found some issues with op time for join that I want to understand better.

@revans2 revans2 marked this pull request as ready for review January 4, 2022 18:53
@revans2
Copy link
Collaborator Author

revans2 commented Jan 4, 2022

build

@jlowe
Copy link
Member

jlowe commented Jan 4, 2022

build

@revans2 revans2 merged commit b3d37ae into NVIDIA:branch-22.02 Jan 5, 2022
@revans2 revans2 deleted the read_metrics branch January 5, 2022 12:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
task Work required that improves the product but is not user facing tools
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants