-
-
Notifications
You must be signed in to change notification settings - Fork 183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adds reduce_sum and tests #1813
Conversation
@bbbales2 it looks like the static and regular tests only differ in calling Also do we want |
I kept them in separate files so that it's easier for someone to find the tests (since we mostly do these things individually). Sort of in the same spirit that there would have been a more compact way to write all the tests we have, but it's just easier to figure out exactly what blew up when they're broken apart. Your call what you wanna do. I felt the same way as you, but just went with copy and paste for the reason above.
There's gonna be no performance hit, and I think regular C++ is easier to read. |
…gs/RELEASE_500/final)
I changed copyright holder to Columbia Uni for Ben, but I'm not sure what the deal is with copyright on things Steve does. Do you know, @SteveBronder? |
I'm not sure but I'd bet it's either me or columbia so as long as columbia is up there it should be fine |
Cool! Will review this soon. Do we have a proper issue for this already? |
Nah, you could make one and point it at the design doc I suppose. Come up with any names other than |
I'm going to put them together in the same file but with separate tests so it's easier to see what broke |
Merging prim and mix's tests for static and normal together deleted like a 1000 lines of code. I think that's a reasonable tradeoff |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking great already. Just cleanups and we need to sort out what we do for the case someone has the odd idea to not turn on STAN_THREADS....
TEST(StanMathRev_reduce_sum, value) { | ||
using stan::math::test::count_lpdf; | ||
using stan::math::test::get_new_msg; | ||
stan::math::init_threadpool_tbb(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The init_threadpool_tbb function requires that STAN_NUM_THREADS is set to really get multiple threads up and running. I would suggest we avoid that and instead do either
- use the global_control thing from the TBB
- default construct the task_scheduler_init
the global control thing would look like:
const size_t parallelism = std::thread::hardware_concurrency();
// total parallelism that TBB can utilize is cut in half for the dynamic extension
// of the given scope, including calls to foo() and bar()
global_control c(global_control::max_allowed_parallelism, parallelism);
the task_scheduler_init init thing would look like
task_scheduler_init default_scheduler;
since the default construction will use all threads... however, the task_scheduler_init object is deprecated since one of the more recent releases of the TBB.
This way we ensure that the tests will run with multiple threads regardless of STAN_NUM_THREADS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the smallest, most self-contained, easy to accidentally understand thing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably the declaration of the task scheduler init which is deprecated by now...but whatever, we are not going to update the tbb given that the tbb is rarely updated on CRAN.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In short: Please replace all calls
stan::math::init_threadpool_tbb();
with
tbb::task_scheduler_init default_scheduler;
that will allocate as many CPUs as there are on the machine where we run - regardless of the STAN_NUM_THREADS
environment variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should all the prim tests and such get these (not all of them had these)? Could I just do one globally in each file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uhm... sorry... but there is a much simpler solution here: Don't do any initialisation anywhere. That will trigger the default behaviour of the TBB which is to use all cores. So maybe just get rid of init_threadpool_tbb (which is really just there to restrict things to what is set in STAN_NUM_THREADS
) and also do not declare any tbb::task_scheduler_init
. Sounds good?
That's a pretty wild error
I'm gonna break the mix up into two files to see if that helps |
Oh yeah, that file is definitely too big for the Windows g++ 4.9.3. We had this issue in the flatten. I have a windows machine next to me. Want me to split it so you dont stab in the dark? |
I can probably get it, unless you think it's v confusing then I'd be happy for the assist! Is the issue just thatI'm running too much in each test or there are too many tests in each file? |
Too many tests in a file and its really hard to find (guess) where the actual limit is. For that reason we have log_mix_part1_test and part2_test in https://github.com/stan-dev/math/tree/develop/test/unit/math/mix/fun |
hmm, I think we can just cut the file in half and it's fine. lemme give it as shot |
Ugh actually @rok-cesnovar would you be able to give me a hand with this? |
template <typename ReduceFunction, typename ReturnType, typename Vec, | ||
typename... Args> | ||
struct reduce_sum_impl<ReduceFunction, require_var_t<ReturnType>, ReturnType, | ||
Vec, Args...> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am afraid that this specialisation magic is not working right now correctly. I just ran the ode test model with this branch and when using reduce_sum I am not getting any speedup and Stan is also not using multiple threads.
... and a test for this would be great ... not yet sure how to test that though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a dumb dumb @bbbales2 I forgot to add reduce_sum.hpp in rev to rev/functor.hpp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries, this can happen - let's just put a test as suggested below to make sure we will never deploy this type of mistake.
Ah. I forgot: Do we have tests in place which check that we get 0 for the case of an empty vmapped argument? We need that for the double and var case. |
I also have R scripts written which runs my example on the back of batchtools. This is an R package which can dispatch the work to a number of backends - including multi-core machines & clusters. Do you want to have a look at that? Only with these R scripts I was able to create comprehensive comparisons. |
…into feature/reduce_sum
Well I added the std::forward calls but I think that was a mistake so I removed them again and then I thought I saw a way to add them again so I added them again. std::forward acts like std::move for rvalues, so I think it's only safe to do one forward after you're done with the variable. In this case the forwards were used on the later functions: https://github.com/stan-dev/math/pull/1813/files#diff-8283b21f45f7557cada3e4da4bc636b6R232 So I just swapped those save_varis calls to be before the recursive_reducer constructor so the recursive_reducer constructor would get to use the forwarding. But still is this really safe @SteveBronder ? Like, we pass things into the constructor of that recursive_reducer, but then if that object gets destructed then the memory attached to those variables (if they were rvalues going in) would get freed. TBB makes copies of the first recursive reducer to get the other reducers, but I don't think we know that it won't deallocate that first recursive reducer before it calls operator() on another one? |
But we are constructing a recursive_reducer ourselves and give that to the TBB reduce. Thus, recursive_reducer which we feed with the forwarded stuff will always be in scope and will not be deallocated during the life-time of the use of this object. It would be good if @SteveBronder can comment. I understood that forward just keeps rvalues and lvalues as to what they are; thus I would have expected that we can use forward twice on a given argument, since it just maintains the reference to the object in the form given to the function. |
@wds15 aaaah, good point |
Jenkins Console Log Machine informationProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010CPU: G++: Clang: |
Sebastian is right here, that first reducer is made and the data is stored then and deleted once it leaves scope. If we did something like return a promise from
Forward treats lvalues as not movable and rvalues as movable. That means for rvalues, if the object has a move constructor defined it will move ownership of the original object's memory and leave the initial object in a valid but unspecified state. That means you only want to forward once you don't need the original object anymore. |
@SteveBronder Then we're good to go? The current version does a single forward per argument: https://github.com/stan-dev/math/pull/1813/files#diff-8283b21f45f7557cada3e4da4bc636b6R223 |
I'll give it all a review rn |
I am running over night the big benchmark once more. If that turns out to look good, then I will approve and merge unless @SteveBronder (or someone else) finds a showstopper. |
@wds15 sooner the better. We should all move on to trying to break @rok-cesnovar 's branch: stan-dev/stanc3#451 (comment) |
All looks good just added a could little things ad02fdf One thing, I added |
Merge when the tests pass @wds15 wooooo |
Jenkins Console Log Machine informationProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010CPU: G++: Clang: |
Grr... I had to find out that the benchmark crashed over night, as I had grainsize=0 which is now blocked (correctly) from the reduce_sum function. Rerunning it... ~5h from now it should be done... |
Summary
This adds a
reduce_sum
andreduce_sum_static
function to stan math that chunks up a functions arguments and then applies the function with a sum reduction in parallel.Tests
Tests exist for prim, rev, and mix and can be run by first setting
STAN_THREADS
in make/local to true and then callingSide Effects
None
Release notes
Adds a
reduce_sum
functor that performs a parallel map reduce over a function (see design doc here: https://github.com/stan-dev/design-docs/blob/master/designs/0017-reduce_sum.md), and areduce_sum_static
implementation that splits the work in a more deterministic way.Checklist
Math issue "reduce_sum" parallel facility #1815 (also see previous long WIP PR [WIP] Parallel Prototype #1616
Copyright holder: Sebastian Weber, Steve Bronder, Columbia University
The copyright holder is typically you or your assignee, such as a university or company. By submitting this pull request, the copyright holder is agreeing to the license the submitted work under the following licenses:
- Code: BSD 3-clause (https://opensource.org/licenses/BSD-3-Clause)
- Documentation: CC-BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
the basic tests are passing
./runTests.py test/unit
)make test-headers
)make test-math-dependencies
)make doxygen
)make cpplint
)the code is written in idiomatic C++ and changes are documented in the doxygen
the new changes are tested