Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PERF] AST kernel has nonzero stack frame #5902

Closed
bdice opened this issue Aug 10, 2020 · 7 comments
Closed

[PERF] AST kernel has nonzero stack frame #5902

bdice opened this issue Aug 10, 2020 · 7 comments
Labels
0 - Backlog In queue waiting for assignment improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue

Comments

@bdice
Copy link
Contributor

bdice commented Aug 10, 2020

During development of #5494, I have been expanding the supported AST interpreter's feature set. With a smaller set of supported operators and data types, the kernel ran quickly and had high memory throughput. As the set of features expanded, a nonzero stack frame was introduced and performance dropped significantly. With a nonzero stack frame, each thread must access "local memory" (actually in the global memory bank) during execution, which reduces performance (about 4-10x slower, from ~450 GB/s to ~50-100 GB/s). This issue documents my results from investigating this stack frame and attempts to uncover what caused it.

I have attached a spreadsheet of raw data from building and testing many commits, summarized below.
Stack Frame Tracing.xlsx

  1. The commit "Add typed_operator_dispatch_functor" is the commit introducing double-dispatch to the AST kernel. This commit introduces a stack frame. Prior to this commit, the previous approach to AST device-side code was much more limited in functionality, which is why we redesigned the internals to use double-dispatch.

  2. During early development of the double-dispatch code, I had zero stack frame with double-dispatch, but only until I merged in new code from upstream in branch-0.15. This suggests that the addition of new types (possibly fixed point) caused the kernel size to exceed some unknown limit, forcing it to have a nonzero stack frame. To reproduce this, I rebased the AST branch onto an older version of branch-0.15 from June 17th (the date the AST branch was created) and did not merge in any upstream changes from branch-0.15 after that date (except for the required double-dispatch PR [REVIEW] Add a double type dispatcher. #5716). It looks like this commit triggered the introduction of a stack frame on branch ast-rebase3: bdice@12cdb22

One way to get these statistics is:

cuobjdump --dump-resource-usage cpp/build/release/CMakeFiles/cudf.dir/src/ast/ast.cu.o | c++filt

or to build with nvcc flags -Xptxas="-v", but that's much slower (ccache won't work if the compilation produces output).

@bdice bdice added the Performance Performance related issue label Aug 10, 2020
@bdice bdice self-assigned this Aug 10, 2020
@github-actions
Copy link

This issue has been marked rotten due to no recent activity in the past 90d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

@bdice
Copy link
Contributor Author

bdice commented Feb 19, 2021

@harrism I think this issue can be closed. It provides some helpful information about the development of the AST feature but no further action is needed. Please re-open if needed. 👍

@bdice bdice closed this as completed Feb 19, 2021
@jrhemstad
Copy link
Contributor

Given we're picking back up on the AST work, I'd actually like to keep this open so we don't forget about it.

@jrhemstad jrhemstad reopened this Feb 19, 2021
@jrhemstad jrhemstad added 0 - Backlog In queue waiting for assignment and removed inactive-90d labels Feb 19, 2021
@kkraus14 kkraus14 added the improvement Improvement / enhancement to an existing function label Mar 29, 2021
@beckernick beckernick added this to the Conditional Joins milestone Jul 23, 2021
@vyasr
Copy link
Contributor

vyasr commented Aug 10, 2021

@jrhemstad I don't think there's anything actionable here any longer, is there? We haven't seen any performance regressions in the various merged iterations of the AST code since, so I don't think we have incurred any unforeseen additional local memory accesses. There are definitely things we can try to do reduce register pressure from the complex internals (mostly relating to trying to redesign the data reference structures), but I don't think those are directly related to this issue any longer and we know to keep an eye out for this problem as a potential cause if we observe future performance regressions.

@jrhemstad
Copy link
Contributor

We haven't seen any performance regressions in the various merged iterations of the AST code since

If you compare to the performance @bdice originally saw before there was any stack frame, I think there is still a pretty sizable regression. I think there is still room for analysis and optimization, but it's not pressing.

@vyasr
Copy link
Contributor

vyasr commented Aug 11, 2021

Interesting. As of #8214 I found that the AST-based equality join was just under 2x slower than the raw nested loop join using a simple row equality comparator. I wouldn't have anticipated being able to reduce the overhead any further, but that's encouraging if you think there's further remove to improve that.

@vyasr
Copy link
Contributor

vyasr commented May 16, 2024

I'm going to close this. Over time we may be able to leverage some new compiler options or similar to improve this situation marginally, but in the long term I think the only way we'll really be able to overcome the performance issues with evaluation are to switch away from the AST approach and go with something like #15366

@vyasr vyasr closed this as completed May 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue
Projects
None yet
Development

No branches or pull requests

6 participants