Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Force inlining to improve AST performance #9530

Merged
merged 6 commits into from
Nov 12, 2021

Conversation

vyasr
Copy link
Contributor

@vyasr vyasr commented Oct 26, 2021

This PR is a minimal set of changes that appear to be required to maximize inlining of code involved in AST-based expression evaluation. Specifically, applying the __forceinline__ qualifier to the type dispatcher results in a significant performance improvement when nullable data is passed through the AST evaluator. I observe roughly a 2x performance improvement with a negligible increase in compilation time. Note that the specific improvements appear to be heavily dependent on the architecture being tested on. Interestingly, removing the __forceinline__ on the evaluate methods results in performance regressing back to its original values despite them being defined inline, an unexpected result.

…f the expression_evaluator to improve null performance.
@vyasr vyasr requested a review from a team as a code owner October 26, 2021 23:12
@vyasr vyasr requested review from bdice and vuule October 26, 2021 23:12
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Oct 26, 2021
@vyasr vyasr requested a review from jrhemstad October 26, 2021 23:13
@vyasr vyasr self-assigned this Oct 26, 2021
@vyasr vyasr added 3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Performance Performance related issue labels Oct 26, 2021
@vyasr vyasr added this to the Conditional Joins milestone Oct 26, 2021
@vyasr
Copy link
Contributor Author

vyasr commented Oct 26, 2021

Out of morbid curiosity, I tested a serial compilation of libcudf without ccache.
Before:

real    300m19.754s
user    290m33.139s
sys     9m37.246s

After:

real    301m38.734s
user    291m58.041s
sys     9m29.889s

And here are benchmarks (note the differences for nullable data):

Benchmarks

Before:

----------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                                    Time             CPU   Iterations
----------------------------------------------------------------------------------------------------------------------------------------------
ConditionalJoin<int32_t, int32_t>/conditional_inner_join_32bit/100000/100000/manual_time                   318 ms          318 ms            2
ConditionalJoin<int32_t, int32_t>/conditional_inner_join_32bit/100000/400000/manual_time                  1060 ms         1060 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_inner_join_32bit/100000/1000000/manual_time                 2566 ms         2566 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_inner_join_64bit/100000/100000/manual_time                   339 ms          339 ms            2
ConditionalJoin<int64_t, int64_t>/conditional_inner_join_64bit/100000/400000/manual_time                  1115 ms         1115 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_inner_join_64bit/100000/1000000/manual_time                 2703 ms         2703 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_inner_join_32bit_nulls/100000/100000/manual_time             738 ms          738 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_inner_join_32bit_nulls/100000/400000/manual_time            2471 ms         2471 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_inner_join_32bit_nulls/100000/1000000/manual_time           6001 ms         6001 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_inner_join_64bit_nulls/100000/100000/manual_time             773 ms          773 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_inner_join_64bit_nulls/100000/400000/manual_time            2616 ms         2616 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_inner_join_64bit_nulls/100000/1000000/manual_time           6379 ms         6379 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_join_32bit/100000/100000/manual_time                    326 ms          326 ms            2
ConditionalJoin<int32_t, int32_t>/conditional_left_join_32bit/100000/400000/manual_time                   1092 ms         1092 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_join_32bit/100000/1000000/manual_time                  2649 ms         2649 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_join_64bit/100000/100000/manual_time                    350 ms          350 ms            2
ConditionalJoin<int64_t, int64_t>/conditional_left_join_64bit/100000/400000/manual_time                   1153 ms         1153 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_join_64bit/100000/1000000/manual_time                  2786 ms         2786 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_join_32bit_nulls/100000/100000/manual_time              757 ms          757 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_join_32bit_nulls/100000/400000/manual_time             2544 ms         2544 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_join_32bit_nulls/100000/1000000/manual_time            6190 ms         6190 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_join_64bit_nulls/100000/100000/manual_time              797 ms          797 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_join_64bit_nulls/100000/400000/manual_time             2709 ms         2709 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_join_64bit_nulls/100000/1000000/manual_time            6549 ms         6548 ms            1

After

----------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                                    Time             CPU   Iterations
----------------------------------------------------------------------------------------------------------------------------------------------
ConditionalJoin<int32_t, int32_t>/conditional_inner_join_32bit/100000/100000/manual_time                   317 ms          317 ms            2
ConditionalJoin<int32_t, int32_t>/conditional_inner_join_32bit/100000/400000/manual_time                  1065 ms         1065 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_inner_join_32bit/100000/1000000/manual_time                 2582 ms         2582 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_inner_join_64bit/100000/100000/manual_time                   340 ms          340 ms            2
ConditionalJoin<int64_t, int64_t>/conditional_inner_join_64bit/100000/400000/manual_time                  1127 ms         1127 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_inner_join_64bit/100000/1000000/manual_time                 2721 ms         2721 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_inner_join_32bit_nulls/100000/100000/manual_time             419 ms          419 ms            2
ConditionalJoin<int32_t, int32_t>/conditional_inner_join_32bit_nulls/100000/400000/manual_time            1426 ms         1425 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_inner_join_32bit_nulls/100000/1000000/manual_time           3462 ms         3462 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_inner_join_64bit_nulls/100000/100000/manual_time             448 ms          448 ms            2
ConditionalJoin<int64_t, int64_t>/conditional_inner_join_64bit_nulls/100000/400000/manual_time            1518 ms         1518 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_inner_join_64bit_nulls/100000/1000000/manual_time           3703 ms         3703 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_join_32bit/100000/100000/manual_time                    326 ms          326 ms            2
ConditionalJoin<int32_t, int32_t>/conditional_left_join_32bit/100000/400000/manual_time                   1094 ms         1093 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_join_32bit/100000/1000000/manual_time                  2652 ms         2652 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_join_64bit/100000/100000/manual_time                    350 ms          350 ms            2
ConditionalJoin<int64_t, int64_t>/conditional_left_join_64bit/100000/400000/manual_time                   1154 ms         1154 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_join_64bit/100000/1000000/manual_time                  2799 ms         2799 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_join_32bit_nulls/100000/100000/manual_time              431 ms          431 ms            2
ConditionalJoin<int32_t, int32_t>/conditional_left_join_32bit_nulls/100000/400000/manual_time             1461 ms         1461 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_join_32bit_nulls/100000/1000000/manual_time            3544 ms         3544 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_join_64bit_nulls/100000/100000/manual_time              457 ms          457 ms            2
ConditionalJoin<int64_t, int64_t>/conditional_left_join_64bit_nulls/100000/400000/manual_time             1557 ms         1557 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_join_64bit_nulls/100000/1000000/manual_time            3795 ms         3795 ms            1

@codecov
Copy link

codecov bot commented Oct 27, 2021

Codecov Report

Merging #9530 (1733fd4) into branch-21.12 (ab4bfaa) will decrease coverage by 0.11%.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff                @@
##           branch-21.12    #9530      +/-   ##
================================================
- Coverage         10.79%   10.67%   -0.12%     
================================================
  Files               116      117       +1     
  Lines             18869    19884    +1015     
================================================
+ Hits               2036     2123      +87     
- Misses            16833    17761     +928     
Impacted Files Coverage Δ
python/dask_cudf/dask_cudf/sorting.py 92.90% <0.00%> (-1.21%) ⬇️
python/cudf/cudf/io/csv.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/hdf.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/orc.py 0.00% <0.00%> (ø)
python/cudf/cudf/__init__.py 0.00% <0.00%> (ø)
python/cudf/cudf/_version.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/abc.py 0.00% <0.00%> (ø)
python/cudf/cudf/api/types.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/dlpack.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/frame.py 0.00% <0.00%> (ø)
... and 67 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2144034...1733fd4. Read the comment docs.

@vyasr vyasr requested a review from jrhemstad October 30, 2021 00:22
@vyasr
Copy link
Contributor Author

vyasr commented Nov 1, 2021

@jrhemstad @bdice do we want to keep this PR very limited in scope, or should I should modify all places in libcudf that use __forceinline__ __device__ [__host__] in the code base to use the new macro(s)? If I'm going to make the change globally, then I may as well define both macros (a __host__ __device__ version and a __device__ version) since both have uses. I think I'd still prefer the longer names, which are much clearer than any abbreviation despite the length, but there are enough places where people would use it that I could be convinced otherwise.

@bdice
Copy link
Contributor

bdice commented Nov 2, 2021

@jrhemstad @bdice do we want to keep this PR very limited in scope, or should I should modify all places in libcudf that use __forceinline__ __device__ [__host__] in the code base to use the new macro(s)? If I'm going to make the change globally, then I may as well define both macros (a __host__ __device__ version and a __device__ version) since both have uses. I think I'd still prefer the longer names, which are much clearer than any abbreviation despite the length, but there are enough places where people would use it that I could be convinced otherwise.

There are a decent handful of existing uses of __forceinline__. I would like to see this PR define both macros (host+device and device), regardless of whether we choose to apply them globally in this PR or in a follow-up PR. My vote would be to keep this PR scope limited and apply the macros globally in a second PR.

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HDFI/DFI LGTM

@vyasr
Copy link
Contributor Author

vyasr commented Nov 12, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 84e5a03 into rapidsai:branch-21.12 Nov 12, 2021
rapids-bot bot pushed a commit that referenced this pull request Jan 10, 2022
This PR is a follow-up to #9530 to standardize the names of the macros used for the `__host__ __device__` attributes. Aliases for `__device__` and combinations with inlining have been removed, and the only remaining macro is `CUDF_HOST_DEVICE` which is `__host__ __device__` in device code and empty in host code. See #9530 (comment) for more discussion.

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - Jake Hemstad (https://github.com/jrhemstad)
  - Bradley Dice (https://github.com/bdice)
  - Mark Harris (https://github.com/harrism)

URL: #9797
@vyasr vyasr deleted the perf/ast_forceinline_types branch January 14, 2022 18:02
@bdice bdice mentioned this pull request Jan 26, 2022
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Performance Performance related issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants