Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]The metrics "total time“ of GpuColumnarExchange is strange #952

Closed
GaryShen2008 opened this issue Oct 15, 2020 · 3 comments · Fixed by #973
Closed

[BUG]The metrics "total time“ of GpuColumnarExchange is strange #952

GaryShen2008 opened this issue Oct 15, 2020 · 3 comments · Fixed by #973
Assignees
Labels
bug Something isn't working

Comments

@GaryShen2008
Copy link
Collaborator

Describe the bug
When using the Legacy Shuffle, the SQL plan showed as below.
image
The "total time total" is less than the "shuffle write time" + "fetch wait time".
It may confuse the user.

Steps/Code to reproduce bug
Run any ETL app with the legacy shuffle.

Expected behavior
Make the metrics name correct or find a way to show the correct value of total time.

Environment details (please complete the following information)

  • Environment location: Standalone, Any cluster type can reproduce, I think.
  • Spark configuration settings related to the issue. No related settings.

Additional context
None

@GaryShen2008 GaryShen2008 added bug Something isn't working ? - Needs Triage Need team to review and classify labels Oct 15, 2020
@JustPlay
Copy link

some question about metics in scan node

GpuScan parquet data_3t.store_sales

number of files read: 1,824
scan time total (min, med, max (stageId: taskId))
1.03 h (676 ms, 3.1 s, 11.1 s (stage 3.0: task 4092))
number of output columnar batches: 1,852
peak device memory total (min, med, max (stageId: taskId))
665.0 GiB (270.4 MiB, 616.6 MiB, 1599.0 MiB (stage 3.0: task 3833))
GPU decode time total (min, med, max (stageId: taskId))
6.1 m (76 ms, 297 ms, 2.9 s (stage 3.0: task 3967))
metadata time: 2.9 s
size of files read: 1536.6 GiB
number of output rows: 28,799,975,831
total time total (min, med, max (stageId: taskId))
54.4 m (530 ms, 2.7 s, 10.6 s (stage 3.0: task 4177))
number of partitions read: 1,824
buffer time total (min, med, max (stageId: taskId))
49.8 m (389 ms, 2.1 s, 22.3 s (stage 3.0: task 4662))

for scan node

  1. what scan time, GPU decode time, total time, buffer time mean?
  2. why total_time < scan_time?

Thanks

@andygrove andygrove self-assigned this Oct 16, 2020
@andygrove
Copy link
Contributor

@GaryShen2008 I looked into this and I agree that the value we are currently reporting for total time for shuffles is confusing (it is actually measuring the time to create an internal iterator which is misleading). Given that Spark doesn't report this metric for shuffles I propose that we remove this metric. I will create a PR.

@jlowe
Copy link
Member

jlowe commented Oct 16, 2020

@JustPlay apologies for the delay, here's some answers for your querstions.

what scan time, GPU decode time, total time, buffer time mean?

  • scan time is a standard Spark metric measuring the time it takes the underlying FileScanRDD to produce the batches, i.e.: the time needed to scan the input source. See https://github.com/apache/spark/blob/branch-3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L484-L492 for code details.
  • GPU decode time measures the time spent waiting for the libcudf call to complete. This measures the amount of time it took the GPU to decode the data once it was loaded into host memory.
  • total time is a metric added to many GPU exec nodes measuring the time spent processing in that node. In the case of scan nodes, it's fairly redundant with scan time metric.
  • buffer time measures the amount of time spent loading the data from the distributed filesystem into host memory. Note that when multithreaded reads are enabled via spark.rapids.sql.format.parquet.multiThreadedRead.enabled then you can see more time spent in buffer time than is reflected in the wall-clock time measured by scan time and total time. This is similar to how user time can exceed real time when using the time command to measure commands.

why total_time < scan_time?

The latter is a standard Spark metric for scans that is measuring time at a very high level for the node, i.e.: the hasNext method of doExecuteColumnar. The former is a metric added by the plugin that is measuring a strict subset of what scan time measures. In practice they normally should be very close to each other and thus having both metrics isn't adding a lot of value. I noticed that total time is not measuring the time it takes to acquire the GPU semaphore when a task has an empty batch (e.g.: there are no row groups to read in Parquet when processing an input split, so it hasn't grabbed the semaphore as part of normal processing). That may be the reason your scan time and total time aren't closer together.

@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Oct 20, 2020
tgravescs pushed a commit to tgravescs/spark-rapids that referenced this issue Nov 30, 2023
…IDIA#952)

Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants