optimization #2551

t4c1 · 2021-07-27T12:34:51Z

Summary

Optimize kernel generator so it can use matrix_cl's move assignment where possible instead of copying the data.

Tests

Added a test to check that the new optimization is used.

Side Effects

None.

Release notes

OpenCL: Optimized kernel generator so it can use matrix_cl's move assignment where possible instead of copying the data.

Checklist

Math issue #(issue number)
Copyright holder: Tadej Ciglarič

The copyright holder is typically you or your assignee, such as a university or company. By submitting this pull request, the copyright holder is agreeing to the license the submitted work under the following licenses:
- Code: BSD 3-clause (https://opensource.org/licenses/BSD-3-Clause)
- Documentation: CC-BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
the basic tests are passing
- unit tests pass (to run, use: ./runTests.py test/unit)
- header checks pass, (make test-headers)
- dependencies checks pass, (make test-math-dependencies)
- docs build, (make doxygen)
- code passes the built in C++ standards checks (make cpplint)
the code is written in idiomatic C++ and changes are documented in the doxygen
the new changes are tested

…4.1 (tags/RELEASE_600/final)

stan-buildbot · 2021-07-27T22:34:39Z

Name	Old Result	New Result	Ratio	Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan	3.14	3.03	1.04	3.45% faster
low_dim_corr_gauss/low_dim_corr_gauss.stan	0.02	0.02	0.97	-3.07% slower
eight_schools/eight_schools.stan	0.11	0.11	1.05	5.04% faster
gp_regr/gp_regr.stan	0.16	0.16	1.01	1.14% faster
irt_2pl/irt_2pl.stan	5.9	5.82	1.01	1.41% faster
performance.compilation	87.82	87.27	1.01	0.62% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan	8.67	8.55	1.01	1.29% faster
pkpd/one_comp_mm_elim_abs.stan	29.96	29.69	1.01	0.91% faster
sir/sir.stan	128.18	130.53	0.98	-1.83% slower
gp_regr/gen_gp_data.stan	0.03	0.03	1.0	0.42% faster
low_dim_gauss_mix/low_dim_gauss_mix.stan	3.01	2.99	1.01	0.97% faster
pkpd/sim_one_comp_mm_elim_abs.stan	0.4	0.39	1.02	2.2% faster
arK/arK.stan	1.88	1.88	1.0	0.09% faster
arma/arma.stan	0.93	0.82	1.12	10.97% faster
garch/garch.stan	0.63	0.53	1.19	15.83% faster
Mean result: 1.02960269611

Jenkins Console Log
Blue Ocean
Commit hash: 6738d7b

Machine information

ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

SteveBronder · 2021-07-27T22:38:59Z

stan/math/opencl/kernel_generator/multi_result_kernel.hpp

+    index_apply<sizeof...(T_expressions)>([this, &exprs](auto... Is) mutable {
+      assignment_impl(std::tuple_cat(make_assignment_pair(
+          std::get<Is>(results_),
+          std::forward<T_expressions>(std::get<Is>(exprs.expressions_)))...));
+    });


If the input is && can't you always move instead of forward? And I think this can be made a lot simpler with for_each()

Suggested change

index_apply<sizeof...(T_expressions)>([this, &exprs](auto... Is) mutable {

assignment_impl(std::tuple_cat(make_assignment_pair(

std::get<Is>(results_),

std::forward<T_expressions>(std::get<Is>(exprs.expressions_)))...));

});

for_each([](auto& result, auto&& expr) mutable {

assignment_impl(std::tuple_cat(make_assignment_pair(result, std::move(expr.expressions_))));

}, this->results_, std::move(exprs));

Also can make assignment pair just return back an std::pair?

some make_assignment_pair overloads need to return nothing, which can not be achieved with std::pair. Right now they return an empty tuple.

If the input is && can't you always move instead of forward?

No. The tuple itself might be a rvalue reference, but the elements in it can still be lvalue references, which must not be moved.

I tried to implement this with for_each(), but I can't expand the T_expressions parameter pack, so I can not call the std::forward. This will need to stay as it is (also in this case for_each() would not really simplify the code).

But for that pair, since the rhs is an rvalue doesn't it use the make_assignment_pair() that doesn't call a kernel that you wrote in this PR here? If that test should throw we should check that the result and matrix have the same size before doing the move

You are right about this not being implemented within a kernel. The test, however operates on lvalues, so the new make_assignment_pair() from this PR is not used and kernel would be used if both operations had the same size. As they do not, it correctly throws.

I agree it not exactly nice to have different error handling behavior between l- and rvalues. However, changing the new (rvalue) case to also throw would be a bit hard, as this check would have to be made before the make_assignment_pair() call. That would be a bit more work than I am willing to go into for this PR. Anyway, as it is now in this PR, it works correctly, even if a bit unintuitively.

The test, however operates on lvalues, so the new make_assignment_pair() from this PR is not used and kernel would be used if both operations had the same size. As they do not, it correctly throws.

m1_cl - m2_cl in the test we are talking about is an rvalue though and we are assigning it to a matrix_cl<> that is uninitialized.

EXPECT_THROW(stan::math::results(res1_cl, res2_cl) = stan::math::expressions(m1_cl - m2_cl, m3_cl - m4_cl), std::invalid_argument);

Do we want to always be throwing in the case of incorrect size assignment here? In that case I think it would just be changing the matrix_cl rvalue assign operator to check sizes before doing anything. But imo it feels like it makes sense to allow this since the rhs is replacing the buffer of the matrix_cl.

The other test where we have an expression of a certain size on the lhs still throws

// mismatch in size between an expression and a result that can not be resized auto block1 = stan::math::block_zero_based(m1_cl, 0, 0, 3, 3); EXPECT_THROW(stan::math::results(res1_cl, block1) = stan::math::expressions(m3_cl + m4_cl, m3_cl - m4_cl), std::invalid_argument);

I agree it not exactly nice to have different error handling behavior between l- and rvalues. However, changing the new (rvalue) case to also throw would be a bit hard, as this check would have to be made before the make_assignment_pair() call.

Is there a reason we can't just check the rows and columns of the assigner and assignee and if they mismatch throw an error?

What's the intuition behind not wanting to allow resizing in the second test case? tmk Eigen only stops resizing in the case of an Eigen Map or fixed size matrices of mismatched sizes.

m1_cl - m2_cl in the test we are talking about is an rvalue though and we are assigning it to a matrix_cl<> that is uninitialized.

Oh, right, the reason for not using the new code in that test is the fact that the right hand side expression is not a plain matrix. Reminder of my previous post is still true.

Do we want to always be throwing in the case of incorrect size assignment here?

That is debatable. If we have two assignments of different sizes we can not do them in the same kernel so we throw. If one of these can be handled by matrix_cl move assignment (the new code in this PR) than we don't need to. Although we still could to be consistent with throwing when right hand side is a rvalue and we can not use move assignment.

Is there a reason we can't just check the rows and columns of the assigner and assignee and if they mismatch throw an error?

We already do that and that is not the issue here. The issue is if we want to assign two expressions in the same call and they have different size (the sizes of results match expressions or can be resized so they do).

What's the intuition behind not wanting to allow resizing in the second test case?

It is not about resizing. See my previous answer.

m1_cl - m2_cl in the test we are talking about is an rvalue though and we are assigning it to a matrix_cl<> that is uninitialized.

Oh, right, the reason for not using the new code in that test is the fact that the right hand side expression is not a plain matrix. Reminder of my previous post is still true.

Sorry which post? And if what I wrote should not be using the new make_assignment_pair() can you add a require to guard against that? Though tbh I'm still confused why it can't use the new one with the move assignment since we are assigning an rvalue expression to the res1/2_cl

Is there a reason we can't just check the rows and columns of the assigner and assignee and if they mismatch throw an error?

We already do that and that is not the issue here. The issue is if we want to assign two expressions in the same call and they have different size (the sizes of results match expressions or can be resized so they do).

It sounds like we should have another test that is something like

auto m1_m2_tmp = m1_cl - m2_cl; auto m3_m4_tmp = m1_cl - m2_cl; EXPECT_THROW(stan::math::results(res1_cl, res2_cl) = stan::math::expressions(m1_m2_tmp, m2_m4_tmp), std::invalid_argument);

SteveBronder

Mostly doc questions and the for_each thing

SteveBronder · 2021-08-10T18:04:17Z

stan/math/opencl/kernel_generator/multi_result_kernel.hpp

+    index_apply<sizeof...(T_expressions)>([this, &exprs](auto... Is) mutable {
+      assignment_impl(std::tuple_cat(make_assignment_pair(
+          std::get<Is>(results_),
+          std::forward<T_expressions>(std::get<Is>(exprs.expressions_)))...));
+    });


So I got this to work with the below, I think it's doing the same thing? Are the T_expressions in expressions_cl the same value type in the underlying tuple?

stan::math::for_each([](auto&& result, auto&& expr) mutable { using expr_t = decltype(expr); assignment_impl(std::tuple_cat(make_assignment_pair( result, std::forward<expr_t>(expr)))); }, results_, std::forward<decltype(exprs.expressions_)>(exprs.expressions_));

This does throw an error for one of the tests in multi_result_kernel

matrix_cl<double> m1_cl(3, 3); matrix_cl<double> m2_cl(3, 3); matrix_cl<double> m3_cl(4, 3); matrix_cl<double> m4_cl(4, 3); matrix_cl<double> res1_cl; matrix_cl<double> res2_cl; // mismatch in size between different (expression, result) pairs /** * This currently throws but with my above it does not */ EXPECT_THROW(stan::math::results(res1_cl, res2_cl) = stan::math::expressions(m1_cl - m2_cl, m3_cl - m4_cl), std::invalid_argument);

Should that test throw though? The result matrix is not filled so we move the resulting rhs into res*_cl so I think that's fine?

SteveBronder · 2021-08-10T18:05:15Z

stan/math/opencl/kernel_generator/multi_result_kernel.hpp

+    index_apply<sizeof...(T_expressions)>([this, &exprs](auto... Is) mutable {
+      assignment_impl(std::tuple_cat(make_assignment_pair(
+          std::get<Is>(results_),
+          std::forward<T_expressions>(std::get<Is>(exprs.expressions_)))...));
+    });


If your good with the above I can open up a PR to change the other ones that use index_apply.

SteveBronder · 2021-08-10T18:07:28Z

stan/math/opencl/kernel_generator/multi_result_kernel.hpp

+          std::get<Is>(results_),
+          std::forward<T_expressions>(std::get<Is>(exprs.expressions_)))...));
+    });
+  }

  /**
   * Incrementing \c results_ object by \c expressions_cl object


Missing docs for the templates. Those are pretty complicated requires so please write out what's going on there

Template is doced on line 339. The require is also relatively simple - just the number of results and expressions must match.

SteveBronder · 2021-08-10T18:08:38Z

stan/math/opencl/kernel_generator/multi_result_kernel.hpp

+          conjunction<
+              is_matrix_cl<T_result>, is_matrix_cl<T_expression>,
+              std::is_same<value_type_t<T_expression>, value_type_t<T_result>>,
+              std::is_rvalue_reference<T_expression&&>>>* = nullptr>


If this is strictly for rvalue references why are we forwarding in the below? And why does this one need to be an rvalue reference?

There are two things that can be r- or l-value references. One is tuple (which in this case must be a rvalue reference and than there are tuple elements that can be either.

stan/math/opencl/kernel_generator/multi_result_kernel.hpp

SteveBronder · 2021-08-17T15:40:44Z

stan/math/opencl/kernel_generator/multi_result_kernel.hpp

+
+  /**
+   * Optimized move assignment of a `matrix_cl` into another `matrix_cl`.
+   * @param result result - check


What does result - check mean?

syclik · 2022-08-18T14:26:05Z

As much as I'd prefer not to lose good work, there were outstanding comments that weren't addressed, so I'm closing this.

t4c1 and others added 2 commits July 27, 2021 14:31

optimization

e52c96c

[Jenkins] auto-formatting by clang-format version 6.0.0-1ubuntu2~16.0…

6738d7b

…4.1 (tags/RELEASE_600/final)

SteveBronder reviewed Jul 27, 2021

View reviewed changes

SteveBronder requested changes Aug 10, 2021

View reviewed changes

SteveBronder reviewed Aug 17, 2021

View reviewed changes

syclik closed this Aug 18, 2022

syclik added the wasn't fixed This label indicates the issue wasn't address, but was still closed for book-keeping purposes. label Aug 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimization #2551

optimization #2551

t4c1 commented Jul 27, 2021

stan-buildbot commented Jul 27, 2021

SteveBronder Jul 27, 2021

SteveBronder Jul 28, 2021

t4c1 Jul 28, 2021 •

edited

Loading

t4c1 Jul 28, 2021 •

edited

Loading

t4c1 Jul 28, 2021

SteveBronder Aug 13, 2021

t4c1 Aug 14, 2021

SteveBronder Aug 14, 2021

t4c1 Aug 15, 2021

SteveBronder Aug 17, 2021

SteveBronder left a comment

SteveBronder Aug 10, 2021

SteveBronder Aug 10, 2021

SteveBronder Aug 10, 2021

t4c1 Aug 11, 2021

SteveBronder Aug 10, 2021

t4c1 Aug 11, 2021

SteveBronder Aug 17, 2021

syclik commented Aug 18, 2022

optimization #2551

optimization #2551

Conversation

t4c1 commented Jul 27, 2021

Summary

Tests

Side Effects

Release notes

Checklist

stan-buildbot commented Jul 27, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

t4c1 Jul 28, 2021 • edited Loading

Choose a reason for hiding this comment

t4c1 Jul 28, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SteveBronder left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

syclik commented Aug 18, 2022

t4c1 Jul 28, 2021 •

edited

Loading

t4c1 Jul 28, 2021 •

edited

Loading