[FEA] Restructure AST internals to reduce stack depth and register pressure #9557
Labels
0 - Backlog
In queue waiting for assignment
feature request
New feature or request
libcudf
Affects libcudf (C++/CUDA) code.
Performance
Performance related issue
Milestone
Is your feature request related to a problem? Please describe.
The AST evaluation process currently used for conditional joins and the
compute_column
APIs is heavily dependent on multiple levels of device-side dispatch, both implicit and explicit. Explicit dispatches are performed based on argument types -- which happens once for unary ops and twice for binary ops -- and based on theast_operator
. Implicit dispatch is performed based on the types of data references (column, literal, or intermediate) in the deepest part of the code. Additionally, the entire evaluator is templated on the output type to support writing the results either to a column or to a stack variable (typically allocated within a kernel for the purpose of storing a thread-local output). The high complexity of these different features results in two major bottlenecks for performance:Various changes in #8145 helped reduce this complexity in a number of ways, but the current benchmarks still indicate that the various kernels are limited by register pressure or local memory traffic (depending on the complexity of the non-AST components of the kernel and the impact of null data).
Describe the solution you'd like
A major factor in the introduction of stack frames and the added complexity for the compiler in determining which variables it can safely leave in registers is the multi-level dispatch. We should consider replacing the compile-time dispatch-based solutions for the operator functors with an approach based on dynamic polymorphism of virtual operator functors. We could use something like the visitor pattern to handle type dispatch for binary operations, which would have the added benefit of naturally handling type casting using the language's own casting rules. While this would entail runtime vtable lookups for virtual functions, the tradeoff would be a dramatic simplification of the code the compiler generates since it would no longer need to instantiate large switch statements for every single templated code path and could instead expend its inline budget on more effectively inlining existing code.
Describe alternatives you've considered
Most of the alternatives to simplifying the code have already been completed (or attempted and discarded) in #8145. While those refactorings decreased complexity, going forward complexity is only likely to increase as we add more operators or support for additional cases, and I don't see any other way to simplify this code further.
Additional context
This change would be a substantial undertaking that would involve rewriting a significant chunk of the parsing and evaluation internals of the AST code. As such, we'll probably want to spend the better part of a release prototyping and testing.
The text was updated successfully, but these errors were encountered: