Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Improve PTDS performance #533

Closed
8 of 9 tasks
rongou opened this issue Aug 8, 2020 · 5 comments
Closed
8 of 9 tasks

[FEA] Improve PTDS performance #533

rongou opened this issue Aug 8, 2020 · 5 comments
Assignees
Labels
performance A performance related task/issue

Comments

@rongou
Copy link
Collaborator

rongou commented Aug 8, 2020

Is your feature request related to a problem? Please describe.
CUDA per-thread default stream (PTDS) is enabled for the plugin, but benchmark results haven't shown a big jump compared to the legacy default stream. Need to figure out why and try to improve the performance.

Describe the solution you'd like
At the moment there is no clear solution, but here are some ideas:

Describe alternatives you've considered
It's possible some of the benchmark queries are too I/O bound that increasing GPU concurrency does not help with reducing the wall clock time.

Additional context
Original issue to enable PTDS: #15

@rongou rongou added feature request New feature or request ? - Needs Triage Need team to review and classify labels Aug 8, 2020
@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Aug 10, 2020
@harrism
Copy link

harrism commented Aug 11, 2020

but benchmark results haven't shown a big jump compared to the legacy default stream.

There's an important option you really should consider: a single stream might not be the bottleneck?

I think your first checkbox is crucial (simple standalone benchmark profiling).

@JustPlay
Copy link

@rongou

"but benchmark results haven't shown a big jump compared to the legacy default stream"

what benchmark? and how much?

thanks

@rongou
Copy link
Collaborator Author

rongou commented Aug 11, 2020

We are using TPCx-BB. When I/O is very saturated, PTDS is only slightly faster, but probably not statistically significant. On a single GPU there are 10-20% improvements. Still investigating.

@rongou
Copy link
Collaborator Author

rongou commented Aug 13, 2020

Tried with the shuffle manager enabled. Looks like there is more memory pressure/fragmentation, so had to increase the number of shuffle partitions/reduce gpu concurrency. Seems to perform better than without the shuffle manager, but it's not clear PTDS gives bigger gains.

@rongou
Copy link
Collaborator Author

rongou commented Nov 6, 2020

This should be considered done.

@rongou rongou closed this as completed Nov 6, 2020
@sameerz sameerz added performance A performance related task/issue and removed feature request New feature or request labels Dec 14, 2020
tgravescs pushed a commit to tgravescs/spark-rapids that referenced this issue Nov 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance A performance related task/issue
Projects
None yet
Development

No branches or pull requests

4 participants