Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Restructure AST internals to reduce stack depth and register pressure #9557

Open
vyasr opened this issue Oct 28, 2021 · 3 comments
Open
Assignees
Labels
0 - Backlog In queue waiting for assignment feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue

Comments

@vyasr
Copy link
Contributor

vyasr commented Oct 28, 2021

Is your feature request related to a problem? Please describe.
The AST evaluation process currently used for conditional joins and the compute_column APIs is heavily dependent on multiple levels of device-side dispatch, both implicit and explicit. Explicit dispatches are performed based on argument types -- which happens once for unary ops and twice for binary ops -- and based on the ast_operator. Implicit dispatch is performed based on the types of data references (column, literal, or intermediate) in the deepest part of the code. Additionally, the entire evaluator is templated on the output type to support writing the results either to a column or to a stack variable (typically allocated within a kernel for the purpose of storing a thread-local output). The high complexity of these different features results in two major bottlenecks for performance:

  • As documented in [PERF] AST kernel has nonzero stack frame #5902, we are asking a lot of the compiler to inline all the function calls. However, if the compiler fails to do so, the resulting stack frames introduced in the kernel lead to variables spilling into local memory. The latency associated with increased local memory traffic has a substantial impact on performance. Lots of the recent work done to mitigate that has improved the situation, but Force inlining to improve AST performance #9530 shows that we still stand to make performance gains by increasing inlining (in that case, we observed nearly 2x improvements for benchmarks involving nullable columns).
  • The highly nested function calls means that even if the compiler successfully inlines functions, it may not be consistently minimizing register usage if it doesn't recognize which variables can be safely reused at different levels of the call stack. Improve performance of expression evaluation #9210 exhibited substantial performance improvements through suitable passing of const references, but revealed that pass-by-value semantics of const objects did not always result in the same performance improvements, suggesting that the compiler was not necessarily making the best use of the available information to minimize copying of data.

Various changes in #8145 helped reduce this complexity in a number of ways, but the current benchmarks still indicate that the various kernels are limited by register pressure or local memory traffic (depending on the complexity of the non-AST components of the kernel and the impact of null data).

Describe the solution you'd like
A major factor in the introduction of stack frames and the added complexity for the compiler in determining which variables it can safely leave in registers is the multi-level dispatch. We should consider replacing the compile-time dispatch-based solutions for the operator functors with an approach based on dynamic polymorphism of virtual operator functors. We could use something like the visitor pattern to handle type dispatch for binary operations, which would have the added benefit of naturally handling type casting using the language's own casting rules. While this would entail runtime vtable lookups for virtual functions, the tradeoff would be a dramatic simplification of the code the compiler generates since it would no longer need to instantiate large switch statements for every single templated code path and could instead expend its inline budget on more effectively inlining existing code.

Describe alternatives you've considered
Most of the alternatives to simplifying the code have already been completed (or attempted and discarded) in #8145. While those refactorings decreased complexity, going forward complexity is only likely to increase as we add more operators or support for additional cases, and I don't see any other way to simplify this code further.

Additional context
This change would be a substantial undertaking that would involve rewriting a significant chunk of the parsing and evaluation internals of the AST code. As such, we'll probably want to spend the better part of a release prototyping and testing.

@vyasr vyasr added feature request New feature or request Needs Triage Need team to review and classify labels Oct 28, 2021
@vyasr vyasr added this to the Conditional Joins milestone Oct 28, 2021
@vyasr vyasr self-assigned this Oct 28, 2021
@jrhemstad jrhemstad added Performance Performance related issue tech debt and removed Needs Triage Need team to review and classify labels Oct 28, 2021
@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@vyasr
Copy link
Contributor Author

vyasr commented Jan 10, 2022

While tackling this, it would be nice to also address #9917 (comment) and look into compile-time improvements like #9917 (comment).

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@jrhemstad jrhemstad added 0 - Backlog In queue waiting for assignment and removed inactive-30d labels Feb 10, 2022
@GregoryKimball GregoryKimball added the libcudf Affects libcudf (C++/CUDA) code. label Nov 21, 2022
@vyasr vyasr removed the tech debt label Feb 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue
Projects
None yet
Development

No branches or pull requests

3 participants