Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adds reduce_sum and tests #1813

Merged
merged 50 commits into from
Apr 8, 2020
Merged

adds reduce_sum and tests #1813

merged 50 commits into from
Apr 8, 2020

Conversation

SteveBronder
Copy link
Collaborator

@SteveBronder SteveBronder commented Mar 31, 2020

Summary

This adds a reduce_sum and reduce_sum_static function to stan math that chunks up a functions arguments and then applies the function with a sum reduction in parallel.

Tests

Tests exist for prim, rev, and mix and can be run by first setting STAN_THREADS in make/local to true and then calling

STAN_NUM_THREADS=4 ./runTests.py -j4 ./test/unit/math/ -f reduce_sum

Side Effects

None

Release notes

Adds a reduce_sum functor that performs a parallel map reduce over a function (see design doc here: https://github.com/stan-dev/design-docs/blob/master/designs/0017-reduce_sum.md), and a reduce_sum_static implementation that splits the work in a more deterministic way.

Checklist

  • Math issue "reduce_sum" parallel facility #1815 (also see previous long WIP PR [WIP] Parallel Prototype #1616

  • Copyright holder: Sebastian Weber, Steve Bronder, Columbia University

    The copyright holder is typically you or your assignee, such as a university or company. By submitting this pull request, the copyright holder is agreeing to the license the submitted work under the following licenses:
    - Code: BSD 3-clause (https://opensource.org/licenses/BSD-3-Clause)
    - Documentation: CC-BY 4.0 (https://creativecommons.org/licenses/by/4.0/)

  • the basic tests are passing

    • unit tests pass (to run, use: ./runTests.py test/unit)
    • header checks pass, (make test-headers)
    • dependencies checks pass, (make test-math-dependencies)
    • docs build, (make doxygen)
    • code passes the built in C++ standards checks (make cpplint)
  • the code is written in idiomatic C++ and changes are documented in the doxygen

  • the new changes are tested

@SteveBronder
Copy link
Collaborator Author

@bbbales2 it looks like the static and regular tests only differ in calling reduce_sum_static and reduce_sum, is it okay to just combine the two? That would cut out like 800 lines of code.

Also do we want auto_partitioning to come in as a template parameter? It looks like it's set constant anyway and that will let the compiler clean up the dead code of the if in operator()

@bbbales2
Copy link
Member

@bbbales2 it looks like the static and regular tests only differ in calling reduce_sum_static and reduce_sum, is it okay to just combine the two?

I kept them in separate files so that it's easier for someone to find the tests (since we mostly do these things individually).

Sort of in the same spirit that there would have been a more compact way to write all the tests we have, but it's just easier to figure out exactly what blew up when they're broken apart.

Your call what you wanna do. I felt the same way as you, but just went with copy and paste for the reason above.

Also do we want auto_partitioning to come in as a template parameter?

There's gonna be no performance hit, and I think regular C++ is easier to read.

@bob-carpenter
Copy link
Contributor

I changed copyright holder to Columbia Uni for Ben, but I'm not sure what the deal is with copyright on things Steve does. Do you know, @SteveBronder?

@SteveBronder
Copy link
Collaborator Author

I'm not sure but I'd bet it's either me or columbia so as long as columbia is up there it should be fine

@wds15
Copy link
Contributor

wds15 commented Mar 31, 2020

Cool! Will review this soon. Do we have a proper issue for this already?

@bbbales2
Copy link
Member

Will review this soon. Do we have a proper issue for this already?

Nah, you could make one and point it at the design doc I suppose.

Come up with any names other than reduce_sum_static or reduce_sum_deterministic?

@SteveBronder
Copy link
Collaborator Author

Sort of in the same spirit that there would have been a more compact way to write all the tests we have, but it's just easier to figure out exactly what blew up when they're broken apart.

I'm going to put them together in the same file but with separate tests so it's easier to see what broke

@SteveBronder
Copy link
Collaborator Author

Merging prim and mix's tests for static and normal together deleted like a 1000 lines of code. I think that's a reasonable tradeoff

Copy link
Contributor

@wds15 wds15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great already. Just cleanups and we need to sort out what we do for the case someone has the odd idea to not turn on STAN_THREADS....

stan/math/prim/functor/reduce_sum.hpp Outdated Show resolved Hide resolved
stan/math/prim/functor/reduce_sum.hpp Show resolved Hide resolved
stan/math/prim/functor/reduce_sum.hpp Outdated Show resolved Hide resolved
stan/math/prim/functor/reduce_sum.hpp Outdated Show resolved Hide resolved
stan/math/prim/functor/reduce_sum.hpp Outdated Show resolved Hide resolved
stan/math/rev/functor/reduce_sum.hpp Outdated Show resolved Hide resolved
stan/math/rev/functor/reduce_sum.hpp Outdated Show resolved Hide resolved
test/unit/math/rev/functor/reduce_sum_test.cpp Outdated Show resolved Hide resolved
TEST(StanMathRev_reduce_sum, value) {
using stan::math::test::count_lpdf;
using stan::math::test::get_new_msg;
stan::math::init_threadpool_tbb();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The init_threadpool_tbb function requires that STAN_NUM_THREADS is set to really get multiple threads up and running. I would suggest we avoid that and instead do either

  • use the global_control thing from the TBB
  • default construct the task_scheduler_init

the global control thing would look like:

        const size_t parallelism = std::thread::hardware_concurrency();
        // total parallelism that TBB can utilize is cut in half for the dynamic extension
        // of the given scope, including calls to foo() and bar()
        global_control c(global_control::max_allowed_parallelism, parallelism);

the task_scheduler_init init thing would look like

task_scheduler_init default_scheduler;

since the default construction will use all threads... however, the task_scheduler_init object is deprecated since one of the more recent releases of the TBB.

This way we ensure that the tests will run with multiple threads regardless of STAN_NUM_THREADS.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the smallest, most self-contained, easy to accidentally understand thing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably the declaration of the task scheduler init which is deprecated by now...but whatever, we are not going to update the tbb given that the tbb is rarely updated on CRAN.

Copy link
Contributor

@wds15 wds15 Apr 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In short: Please replace all calls

stan::math::init_threadpool_tbb();

with

tbb::task_scheduler_init default_scheduler;

that will allocate as many CPUs as there are on the machine where we run - regardless of the STAN_NUM_THREADS environment variable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should all the prim tests and such get these (not all of them had these)? Could I just do one globally in each file?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhm... sorry... but there is a much simpler solution here: Don't do any initialisation anywhere. That will trigger the default behaviour of the TBB which is to use all cores. So maybe just get rid of init_threadpool_tbb (which is really just there to restrict things to what is set in STAN_NUM_THREADS) and also do not declare any tbb::task_scheduler_init. Sounds good?

stan/math/rev/functor/reduce_sum.hpp Show resolved Hide resolved
@SteveBronder
Copy link
Collaborator Author

SteveBronder commented Apr 1, 2020

That's a pretty wild error

g++ -std=c++1y -m64 -D_REENTRANT -Wall -Wno-unused-function -Wno-uninitialized -Wno-unused-but-set-variable -Wno-unused-variable -Wno-sign-compare -Wno-unused-local-typedefs -I lib/tbb_2019_U8/include -O3 -I . -I lib/eigen_3.3.3 -I lib/boost_1.72.0 -I lib/sundials_5.1.0/include -I lib/gtest_1.8.1/include -I lib/gtest_1.8.1 -D_USE_MATH_DEFINES -DBOOST_DISABLE_ASSERTS -Wl,-L,"C:/Jenkins/workspace/Math_Pipeline_PR-1813/lib/tbb" -Wl,-rpath,"C:/Jenkins/workspace/Math_Pipeline_PR-1813/lib/tbb" test/unit/math/mix/meta/scalar_type_test.o lib/gtest_1.8.1/src/gtest_main.cc lib/gtest_1.8.1/src/gtest-all.o lib/tbb/tbb.dll -static-libgcc -static-libstdc++ -o test/unit/math/mix/meta/scalar_type_test.exe

test/unit/math/mix/functor/reduce_sum_test.cpp:481:1: internal compiler error: Segmentation fault
}
^
Please submit a full bug report,
with preprocessed source if appropriate.
See http://sourceforge.net/projects/mingw-w64 for instructions.

I'm gonna break the mix up into two files to see if that helps

@rok-cesnovar
Copy link
Member

Oh yeah, that file is definitely too big for the Windows g++ 4.9.3. We had this issue in the flatten.

I have a windows machine next to me. Want me to split it so you dont stab in the dark?

@SteveBronder
Copy link
Collaborator Author

I can probably get it, unless you think it's v confusing then I'd be happy for the assist! Is the issue just thatI'm running too much in each test or there are too many tests in each file?

@rok-cesnovar
Copy link
Member

Too many tests in a file and its really hard to find (guess) where the actual limit is.

For that reason we have log_mix_part1_test and part2_test in https://github.com/stan-dev/math/tree/develop/test/unit/math/mix/fun

@SteveBronder
Copy link
Collaborator Author

hmm, I think we can just cut the file in half and it's fine. lemme give it as shot

@SteveBronder
Copy link
Collaborator Author

Ugh actually @rok-cesnovar would you be able to give me a hand with this?

template <typename ReduceFunction, typename ReturnType, typename Vec,
typename... Args>
struct reduce_sum_impl<ReduceFunction, require_var_t<ReturnType>, ReturnType,
Vec, Args...> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am afraid that this specialisation magic is not working right now correctly. I just ran the ode test model with this branch and when using reduce_sum I am not getting any speedup and Stan is also not using multiple threads.

... and a test for this would be great ... not yet sure how to test that though.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a dumb dumb @bbbales2 I forgot to add reduce_sum.hpp in rev to rev/functor.hpp

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries, this can happen - let's just put a test as suggested below to make sure we will never deploy this type of mistake.

@wds15
Copy link
Contributor

wds15 commented Apr 2, 2020

Ah. I forgot: Do we have tests in place which check that we get 0 for the case of an empty vmapped argument? We need that for the double and var case.

@wds15
Copy link
Contributor

wds15 commented Apr 7, 2020

I also have R scripts written which runs my example on the back of batchtools. This is an R package which can dispatch the work to a number of backends - including multi-core machines & clusters. Do you want to have a look at that? Only with these R scripts I was able to create comprehensive comparisons.

@bbbales2
Copy link
Member

bbbales2 commented Apr 7, 2020

Well I added the std::forward calls but I think that was a mistake so I removed them again and then I thought I saw a way to add them again so I added them again.

std::forward acts like std::move for rvalues, so I think it's only safe to do one forward after you're done with the variable. In this case the forwards were used on the later functions: https://github.com/stan-dev/math/pull/1813/files#diff-8283b21f45f7557cada3e4da4bc636b6R232

So I just swapped those save_varis calls to be before the recursive_reducer constructor so the recursive_reducer constructor would get to use the forwarding.

But still is this really safe @SteveBronder ? Like, we pass things into the constructor of that recursive_reducer, but then if that object gets destructed then the memory attached to those variables (if they were rvalues going in) would get freed.

TBB makes copies of the first recursive reducer to get the other reducers, but I don't think we know that it won't deallocate that first recursive reducer before it calls operator() on another one?

@wds15
Copy link
Contributor

wds15 commented Apr 7, 2020

TBB makes copies of the first recursive reducer to get the other reducers, but I don't think we know that it won't deallocate that first recursive reducer before it calls operator() on another one?

But we are constructing a recursive_reducer ourselves and give that to the TBB reduce. Thus, recursive_reducer which we feed with the forwarded stuff will always be in scope and will not be deallocated during the life-time of the use of this object.

It would be good if @SteveBronder can comment. I understood that forward just keeps rvalues and lvalues as to what they are; thus I would have expected that we can use forward twice on a given argument, since it just maintains the reference to the object in the form given to the function.

@bbbales2
Copy link
Member

bbbales2 commented Apr 7, 2020

But we are constructing a recursive_reducer ourselves and give that to the TBB reduce. Thus, recursive_reducer which we feed with the forwarded stuff will always be in scope and will not be deallocated during the life-time of the use of this object.

@wds15 aaaah, good point

@stan-buildbot
Copy link
Contributor


Name Old Result New Result Ratio Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan 4.83 4.8 1.01 0.78% faster
low_dim_corr_gauss/low_dim_corr_gauss.stan 0.02 0.02 1.01 0.8% faster
eight_schools/eight_schools.stan 0.09 0.09 0.99 -1.24% slower
gp_regr/gp_regr.stan 0.22 0.22 1.0 0.18% faster
irt_2pl/irt_2pl.stan 6.45 6.49 0.99 -0.61% slower
performance.compilation 88.26 87.16 1.01 1.24% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan 7.51 7.55 0.99 -0.5% slower
pkpd/one_comp_mm_elim_abs.stan 20.49 20.57 1.0 -0.41% slower
sir/sir.stan 92.33 94.18 0.98 -2.0% slower
gp_regr/gen_gp_data.stan 0.05 0.05 0.98 -2.3% slower
low_dim_gauss_mix/low_dim_gauss_mix.stan 2.96 2.96 1.0 0.23% faster
pkpd/sim_one_comp_mm_elim_abs.stan 0.31 0.3 1.01 0.61% faster
arK/arK.stan 1.77 1.74 1.02 1.69% faster
arma/arma.stan 0.65 0.66 0.99 -0.94% slower
garch/garch.stan 0.51 0.51 1.0 0.29% faster
Mean result: 0.998665440316

Jenkins Console Log
Blue Ocean
Commit hash: 5bd7f91


Machine information ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

@SteveBronder
Copy link
Collaborator Author

But still is this really safe @SteveBronder ? Like, we pass things into the constructor of that recursive_reducer, but then if that object gets destructed then the memory attached to those variables (if they were rvalues going in) would get freed.

Sebastian is right here, that first reducer is made and the data is stored then and deleted once it leaves scope. If we did something like return a promise from reduce_sum_impl::operator() then it would get weird but the same is true for any reference type we used.

But we are constructing a recursive_reducer ourselves and give that to the TBB reduce. Thus, recursive_reducer which we feed with the forwarded stuff will always be in scope and will not be deallocated during the life-time of the use of this object.

It would be good if @SteveBronder can comment. I understood that forward just keeps rvalues and lvalues as to what they are; thus I would have expected that we can use forward twice on a given argument, since it just maintains the reference to the object in the form given to the function.

Forward treats lvalues as not movable and rvalues as movable. That means for rvalues, if the object has a move constructor defined it will move ownership of the original object's memory and leave the initial object in a valid but unspecified state. That means you only want to forward once you don't need the original object anymore.

@bbbales2
Copy link
Member

bbbales2 commented Apr 7, 2020

@SteveBronder Then we're good to go? The current version does a single forward per argument: https://github.com/stan-dev/math/pull/1813/files#diff-8283b21f45f7557cada3e4da4bc636b6R223

@SteveBronder
Copy link
Collaborator Author

I'll give it all a review rn

@wds15
Copy link
Contributor

wds15 commented Apr 7, 2020

I am running over night the big benchmark once more. If that turns out to look good, then I will approve and merge unless @SteveBronder (or someone else) finds a showstopper.

@bbbales2
Copy link
Member

bbbales2 commented Apr 7, 2020

@wds15 sooner the better. We should all move on to trying to break @rok-cesnovar 's branch: stan-dev/stanc3#451 (comment)

@SteveBronder
Copy link
Collaborator Author

SteveBronder commented Apr 7, 2020

All looks good just added a could little things ad02fdf

One thing, I added export STAN_NUM_THREADS=4 to the jenkinsfile or linux (accidentally added it for windows but removed it.

@bbbales2
Copy link
Member

bbbales2 commented Apr 7, 2020

Merge when the tests pass @wds15 wooooo

@stan-buildbot
Copy link
Contributor


Name Old Result New Result Ratio Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan 4.93 4.94 1.0 -0.19% slower
low_dim_corr_gauss/low_dim_corr_gauss.stan 0.02 0.02 0.98 -2.51% slower
eight_schools/eight_schools.stan 0.09 0.09 1.0 0.16% faster
gp_regr/gp_regr.stan 0.22 0.22 1.0 0.1% faster
irt_2pl/irt_2pl.stan 6.44 6.45 1.0 -0.14% slower
performance.compilation 88.17 87.15 1.01 1.15% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan 7.52 7.53 1.0 -0.02% slower
pkpd/one_comp_mm_elim_abs.stan 20.32 20.64 0.98 -1.55% slower
sir/sir.stan 90.91 96.1 0.95 -5.71% slower
gp_regr/gen_gp_data.stan 0.05 0.05 0.96 -4.66% slower
low_dim_gauss_mix/low_dim_gauss_mix.stan 2.95 2.95 1.0 0.26% faster
pkpd/sim_one_comp_mm_elim_abs.stan 0.31 0.3 1.01 1.01% faster
arK/arK.stan 1.73 1.75 0.99 -0.74% slower
arma/arma.stan 0.66 0.66 1.0 -0.36% slower
garch/garch.stan 0.51 0.51 1.0 0.23% faster
Mean result: 0.991777537571

Jenkins Console Log
Blue Ocean
Commit hash: 8d8ebd1


Machine information ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

@wds15
Copy link
Contributor

wds15 commented Apr 8, 2020

Grr... I had to find out that the benchmark crashed over night, as I had grainsize=0 which is now blocked (correctly) from the reduce_sum function. Rerunning it... ~5h from now it should be done...

@wds15
Copy link
Contributor

wds15 commented Apr 8, 2020

Well... our cluster is apparently busy and I am not getting results as fast as I would like to. Below is a comparison of map_rect and reduce_sum. To me this looks all good. No need to wait any longer (seems like people want to submit large chunks of stuff before the holidays):

ode-bench-elapsed-alt

Approving...

@wds15 wds15 merged commit f02c6df into develop Apr 8, 2020
@rok-cesnovar rok-cesnovar deleted the feature/reduce_sum branch May 3, 2020 08:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants