Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimization #2551

Closed
wants to merge 2 commits into from
Closed

optimization #2551

wants to merge 2 commits into from

Conversation

t4c1
Copy link
Contributor

@t4c1 t4c1 commented Jul 27, 2021

Summary

Optimize kernel generator so it can use matrix_cl's move assignment where possible instead of copying the data.

Tests

Added a test to check that the new optimization is used.

Side Effects

None.

Release notes

OpenCL: Optimized kernel generator so it can use matrix_cl's move assignment where possible instead of copying the data.

Checklist

  • Math issue #(issue number)

  • Copyright holder: Tadej Ciglarič

    The copyright holder is typically you or your assignee, such as a university or company. By submitting this pull request, the copyright holder is agreeing to the license the submitted work under the following licenses:
    - Code: BSD 3-clause (https://opensource.org/licenses/BSD-3-Clause)
    - Documentation: CC-BY 4.0 (https://creativecommons.org/licenses/by/4.0/)

  • the basic tests are passing

    • unit tests pass (to run, use: ./runTests.py test/unit)
    • header checks pass, (make test-headers)
    • dependencies checks pass, (make test-math-dependencies)
    • docs build, (make doxygen)
    • code passes the built in C++ standards checks (make cpplint)
  • the code is written in idiomatic C++ and changes are documented in the doxygen

  • the new changes are tested

@stan-buildbot
Copy link
Contributor


Name Old Result New Result Ratio Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan 3.14 3.03 1.04 3.45% faster
low_dim_corr_gauss/low_dim_corr_gauss.stan 0.02 0.02 0.97 -3.07% slower
eight_schools/eight_schools.stan 0.11 0.11 1.05 5.04% faster
gp_regr/gp_regr.stan 0.16 0.16 1.01 1.14% faster
irt_2pl/irt_2pl.stan 5.9 5.82 1.01 1.41% faster
performance.compilation 87.82 87.27 1.01 0.62% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan 8.67 8.55 1.01 1.29% faster
pkpd/one_comp_mm_elim_abs.stan 29.96 29.69 1.01 0.91% faster
sir/sir.stan 128.18 130.53 0.98 -1.83% slower
gp_regr/gen_gp_data.stan 0.03 0.03 1.0 0.42% faster
low_dim_gauss_mix/low_dim_gauss_mix.stan 3.01 2.99 1.01 0.97% faster
pkpd/sim_one_comp_mm_elim_abs.stan 0.4 0.39 1.02 2.2% faster
arK/arK.stan 1.88 1.88 1.0 0.09% faster
arma/arma.stan 0.93 0.82 1.12 10.97% faster
garch/garch.stan 0.63 0.53 1.19 15.83% faster
Mean result: 1.02960269611

Jenkins Console Log
Blue Ocean
Commit hash: 6738d7b


Machine information ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Comment on lines +328 to +332
index_apply<sizeof...(T_expressions)>([this, &exprs](auto... Is) mutable {
assignment_impl(std::tuple_cat(make_assignment_pair(
std::get<Is>(results_),
std::forward<T_expressions>(std::get<Is>(exprs.expressions_)))...));
});
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the input is && can't you always move instead of forward? And I think this can be made a lot simpler with for_each()

Suggested change
index_apply<sizeof...(T_expressions)>([this, &exprs](auto... Is) mutable {
assignment_impl(std::tuple_cat(make_assignment_pair(
std::get<Is>(results_),
std::forward<T_expressions>(std::get<Is>(exprs.expressions_)))...));
});
for_each([](auto& result, auto&& expr) mutable {
assignment_impl(std::tuple_cat(make_assignment_pair(result, std::move(expr.expressions_))));
}, this->results_, std::move(exprs));

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also can make assignment pair just return back an std::pair?

Copy link
Contributor Author

@t4c1 t4c1 Jul 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some make_assignment_pair overloads need to return nothing, which can not be achieved with std::pair. Right now they return an empty tuple.

Copy link
Contributor Author

@t4c1 t4c1 Jul 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the input is && can't you always move instead of forward?

No. The tuple itself might be a rvalue reference, but the elements in it can still be lvalue references, which must not be moved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to implement this with for_each(), but I can't expand the T_expressions parameter pack, so I can not call the std::forward. This will need to stay as it is (also in this case for_each() would not really simplify the code).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But for that pair, since the rhs is an rvalue doesn't it use the make_assignment_pair() that doesn't call a kernel that you wrote in this PR here? If that test should throw we should check that the result and matrix have the same size before doing the move

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right about this not being implemented within a kernel. The test, however operates on lvalues, so the new make_assignment_pair() from this PR is not used and kernel would be used if both operations had the same size. As they do not, it correctly throws.

I agree it not exactly nice to have different error handling behavior between l- and rvalues. However, changing the new (rvalue) case to also throw would be a bit hard, as this check would have to be made before the make_assignment_pair() call. That would be a bit more work than I am willing to go into for this PR. Anyway, as it is now in this PR, it works correctly, even if a bit unintuitively.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test, however operates on lvalues, so the new make_assignment_pair() from this PR is not used and kernel would be used if both operations had the same size. As they do not, it correctly throws.

m1_cl - m2_cl in the test we are talking about is an rvalue though and we are assigning it to a matrix_cl<> that is uninitialized.

  EXPECT_THROW(stan::math::results(res1_cl, res2_cl)
               = stan::math::expressions(m1_cl - m2_cl, m3_cl - m4_cl),
               std::invalid_argument);

Do we want to always be throwing in the case of incorrect size assignment here? In that case I think it would just be changing the matrix_cl rvalue assign operator to check sizes before doing anything. But imo it feels like it makes sense to allow this since the rhs is replacing the buffer of the matrix_cl.

The other test where we have an expression of a certain size on the lhs still throws

  // mismatch in size between an expression and a result that can not be resized
  auto block1 = stan::math::block_zero_based(m1_cl, 0, 0, 3, 3);
  EXPECT_THROW(stan::math::results(res1_cl, block1)
               = stan::math::expressions(m3_cl + m4_cl, m3_cl - m4_cl),
               std::invalid_argument);

I agree it not exactly nice to have different error handling behavior between l- and rvalues. However, changing the new (rvalue) case to also throw would be a bit hard, as this check would have to be made before the make_assignment_pair() call.

Is there a reason we can't just check the rows and columns of the assigner and assignee and if they mismatch throw an error?

What's the intuition behind not wanting to allow resizing in the second test case? tmk Eigen only stops resizing in the case of an Eigen Map or fixed size matrices of mismatched sizes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

m1_cl - m2_cl in the test we are talking about is an rvalue though and we are assigning it to a matrix_cl<> that is uninitialized.

Oh, right, the reason for not using the new code in that test is the fact that the right hand side expression is not a plain matrix. Reminder of my previous post is still true.

Do we want to always be throwing in the case of incorrect size assignment here?

That is debatable. If we have two assignments of different sizes we can not do them in the same kernel so we throw. If one of these can be handled by matrix_cl move assignment (the new code in this PR) than we don't need to. Although we still could to be consistent with throwing when right hand side is a rvalue and we can not use move assignment.

Is there a reason we can't just check the rows and columns of the assigner and assignee and if they mismatch throw an error?

We already do that and that is not the issue here. The issue is if we want to assign two expressions in the same call and they have different size (the sizes of results match expressions or can be resized so they do).

What's the intuition behind not wanting to allow resizing in the second test case?

It is not about resizing. See my previous answer.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

m1_cl - m2_cl in the test we are talking about is an rvalue though and we are assigning it to a matrix_cl<> that is uninitialized.

Oh, right, the reason for not using the new code in that test is the fact that the right hand side expression is not a plain matrix. Reminder of my previous post is still true.

Sorry which post? And if what I wrote should not be using the new make_assignment_pair() can you add a require to guard against that? Though tbh I'm still confused why it can't use the new one with the move assignment since we are assigning an rvalue expression to the res1/2_cl

Is there a reason we can't just check the rows and columns of the assigner and assignee and if they mismatch throw an error?

We already do that and that is not the issue here. The issue is if we want to assign two expressions in the same call and they have different size (the sizes of results match expressions or can be resized so they do).

It sounds like we should have another test that is something like

  auto m1_m2_tmp = m1_cl - m2_cl;
  auto m3_m4_tmp = m1_cl - m2_cl;
  EXPECT_THROW(stan::math::results(res1_cl, res2_cl)
               = stan::math::expressions(m1_m2_tmp, m2_m4_tmp),
               std::invalid_argument);

Copy link
Collaborator

@SteveBronder SteveBronder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly doc questions and the for_each thing

Comment on lines +328 to +332
index_apply<sizeof...(T_expressions)>([this, &exprs](auto... Is) mutable {
assignment_impl(std::tuple_cat(make_assignment_pair(
std::get<Is>(results_),
std::forward<T_expressions>(std::get<Is>(exprs.expressions_)))...));
});
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I got this to work with the below, I think it's doing the same thing? Are the T_expressions in expressions_cl the same value type in the underlying tuple?

    stan::math::for_each([](auto&& result, auto&& expr) mutable {
      using expr_t = decltype(expr);
      assignment_impl(std::tuple_cat(make_assignment_pair(
          result,
          std::forward<expr_t>(expr))));

    }, results_, std::forward<decltype(exprs.expressions_)>(exprs.expressions_));

This does throw an error for one of the tests in multi_result_kernel

  matrix_cl<double> m1_cl(3, 3);
  matrix_cl<double> m2_cl(3, 3);
  matrix_cl<double> m3_cl(4, 3);
  matrix_cl<double> m4_cl(4, 3);

  matrix_cl<double> res1_cl;
  matrix_cl<double> res2_cl;
  // mismatch in size between different (expression, result) pairs
  /** 
   * This currently throws but with my above it does not
   */
  EXPECT_THROW(stan::math::results(res1_cl, res2_cl)
               = stan::math::expressions(m1_cl - m2_cl, m3_cl - m4_cl),
               std::invalid_argument);

Should that test throw though? The result matrix is not filled so we move the resulting rhs into res*_cl so I think that's fine?

Comment on lines +328 to +332
index_apply<sizeof...(T_expressions)>([this, &exprs](auto... Is) mutable {
assignment_impl(std::tuple_cat(make_assignment_pair(
std::get<Is>(results_),
std::forward<T_expressions>(std::get<Is>(exprs.expressions_)))...));
});
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If your good with the above I can open up a PR to change the other ones that use index_apply.

std::get<Is>(results_),
std::forward<T_expressions>(std::get<Is>(exprs.expressions_)))...));
});
}

/**
* Incrementing \c results_ object by \c expressions_cl object
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing docs for the templates. Those are pretty complicated requires so please write out what's going on there

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Template is doced on line 339. The require is also relatively simple - just the number of results and expressions must match.

conjunction<
is_matrix_cl<T_result>, is_matrix_cl<T_expression>,
std::is_same<value_type_t<T_expression>, value_type_t<T_result>>,
std::is_rvalue_reference<T_expression&&>>>* = nullptr>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is strictly for rvalue references why are we forwarding in the below? And why does this one need to be an rvalue reference?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two things that can be r- or l-value references. One is tuple (which in this case must be a rvalue reference and than there are tuple elements that can be either.


/**
* Optimized move assignment of a `matrix_cl` into another `matrix_cl`.
* @param result result - check
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does result - check mean?

@syclik
Copy link
Member

syclik commented Aug 18, 2022

As much as I'd prefer not to lose good work, there were outstanding comments that weren't addressed, so I'm closing this.

@syclik syclik closed this Aug 18, 2022
@syclik syclik added the wasn't fixed This label indicates the issue wasn't address, but was still closed for book-keeping purposes. label Aug 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wasn't fixed This label indicates the issue wasn't address, but was still closed for book-keeping purposes.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants