Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix GpuShuffleCoalesce op time metric doesn't include concat batch time #5950

Merged
merged 2 commits into from
Jul 7, 2022

Conversation

res-life
Copy link
Collaborator

@res-life res-life commented Jul 5, 2022

Fixes #5891

The problem in the code is that the optime metric does not cover the concatenation time performed in the next() method.
Update the range of optime to include the next().

Signed-off-by: Chong Gao res_life@163.com

@res-life
Copy link
Collaborator Author

res-life commented Jul 5, 2022

build

@res-life
Copy link
Collaborator Author

res-life commented Jul 5, 2022

Now the op time covers the concat time:

GpuShuffleCoalesce

concat batch time total (min, med, max )
8 ms (2 ms, 2 ms, 2 ms )
......
op time total (min, med, max )
1 ms (0 ms, 0 ms, 1 ms )

==>>

GpuShuffleCoalesce

concat batch time total (min, med, max )
7 ms (1 ms, 1 ms, 1 ms )
op time total (min, med, max )
24 ms (4 ms, 6 ms, 7 ms )

@res-life
Copy link
Collaborator Author

res-life commented Jul 5, 2022

The op time now also covers the SEMAPHORE_WAIT_TIME time.

@revans2
Copy link
Collaborator

revans2 commented Jul 5, 2022

The op time now also covers the SEMAPHORE_WAIT_TIME time.

We do not want this. The semaphore wait time is supposed to be a separate thing. Ideally op time is only the time on the GPU.

From https://github.com/NVIDIA/spark-rapids/blob/branch-22.08/docs/tuning-guide.md#metrics

opTime op time Time that an operator takes, exclusive of the time for executing or fetching results from child operators, and typically outside of the time it takes to acquire the GPU semaphore.

@revans2
Copy link
Collaborator

revans2 commented Jul 5, 2022

concat batch time was removed because we are concating the batch on the CPU, not the GPU. opTime was intended to convey the GPU time. I am fine if we redefine it to include CPU time, so long as it is documented clearly and there is a way to pull it back apart again, i.e. we know that the concat time happened on the CPU so the GPU time is opTime - concatTime

Signed-off-by: Chong Gao <res_life@163.com>
@res-life
Copy link
Collaborator Author

res-life commented Jul 6, 2022

build

1 similar comment
@res-life
Copy link
Collaborator Author

res-life commented Jul 6, 2022

build

@sameerz sameerz added the task Work required that improves the product but is not user facing label Jul 6, 2022
@res-life res-life merged commit 0f83616 into NVIDIA:branch-22.08 Jul 7, 2022
@res-life res-life deleted the fix-op-concat-time branch July 7, 2022 02:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
task Work required that improves the product but is not user facing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] GpuShuffleCoalesce op time metric doesn't include concat batch time
3 participants