adds reduce_sum and tests #1813

SteveBronder · 2020-03-31T18:17:15Z

Summary

This adds a reduce_sum and reduce_sum_static function to stan math that chunks up a functions arguments and then applies the function with a sum reduction in parallel.

Tests

Tests exist for prim, rev, and mix and can be run by first setting STAN_THREADS in make/local to true and then calling

STAN_NUM_THREADS=4 ./runTests.py -j4 ./test/unit/math/ -f reduce_sum

Side Effects

None

Release notes

Adds a reduce_sum functor that performs a parallel map reduce over a function (see design doc here: https://github.com/stan-dev/design-docs/blob/master/designs/0017-reduce_sum.md), and a reduce_sum_static implementation that splits the work in a more deterministic way.

Checklist

Math issue "reduce_sum" parallel facility #1815 (also see previous long WIP PR [WIP] Parallel Prototype #1616
Copyright holder: Sebastian Weber, Steve Bronder, Columbia University

The copyright holder is typically you or your assignee, such as a university or company. By submitting this pull request, the copyright holder is agreeing to the license the submitted work under the following licenses:
- Code: BSD 3-clause (https://opensource.org/licenses/BSD-3-Clause)
- Documentation: CC-BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
the basic tests are passing
- unit tests pass (to run, use: ./runTests.py test/unit)
- header checks pass, (make test-headers)
- dependencies checks pass, (make test-math-dependencies)
- docs build, (make doxygen)
- code passes the built in C++ standards checks (make cpplint)
the code is written in idiomatic C++ and changes are documented in the doxygen
the new changes are tested

SteveBronder · 2020-03-31T18:20:10Z

@bbbales2 it looks like the static and regular tests only differ in calling reduce_sum_static and reduce_sum, is it okay to just combine the two? That would cut out like 800 lines of code.

Also do we want auto_partitioning to come in as a template parameter? It looks like it's set constant anyway and that will let the compiler clean up the dead code of the if in operator()

bbbales2 · 2020-03-31T18:30:59Z

@bbbales2 it looks like the static and regular tests only differ in calling reduce_sum_static and reduce_sum, is it okay to just combine the two?

I kept them in separate files so that it's easier for someone to find the tests (since we mostly do these things individually).

Sort of in the same spirit that there would have been a more compact way to write all the tests we have, but it's just easier to figure out exactly what blew up when they're broken apart.

Your call what you wanna do. I felt the same way as you, but just went with copy and paste for the reason above.

Also do we want auto_partitioning to come in as a template parameter?

There's gonna be no performance hit, and I think regular C++ is easier to read.

…gs/RELEASE_500/final)

bob-carpenter · 2020-03-31T19:42:39Z

I changed copyright holder to Columbia Uni for Ben, but I'm not sure what the deal is with copyright on things Steve does. Do you know, @SteveBronder?

SteveBronder · 2020-03-31T20:36:41Z

I'm not sure but I'd bet it's either me or columbia so as long as columbia is up there it should be fine

wds15 · 2020-03-31T20:53:41Z

Cool! Will review this soon. Do we have a proper issue for this already?

bbbales2 · 2020-03-31T22:23:36Z

Will review this soon. Do we have a proper issue for this already?

Nah, you could make one and point it at the design doc I suppose.

Come up with any names other than reduce_sum_static or reduce_sum_deterministic?

SteveBronder · 2020-03-31T23:01:08Z

Sort of in the same spirit that there would have been a more compact way to write all the tests we have, but it's just easier to figure out exactly what blew up when they're broken apart.

I'm going to put them together in the same file but with separate tests so it's easier to see what broke

…stable/2017-11-14)

SteveBronder · 2020-04-01T03:33:37Z

Merging prim and mix's tests for static and normal together deleted like a 1000 lines of code. I think that's a reasonable tradeoff

…stable/2017-11-14)

wds15

Looking great already. Just cleanups and we need to sort out what we do for the case someone has the odd idea to not turn on STAN_THREADS....

stan/math/prim/functor/reduce_sum.hpp

stan/math/rev/functor/reduce_sum.hpp

test/unit/math/rev/functor/reduce_sum_test.cpp

wds15 · 2020-04-01T09:36:11Z

test/unit/math/rev/functor/reduce_sum_test.cpp

+TEST(StanMathRev_reduce_sum, value) {
+  using stan::math::test::count_lpdf;
+  using stan::math::test::get_new_msg;
+  stan::math::init_threadpool_tbb();


The init_threadpool_tbb function requires that STAN_NUM_THREADS is set to really get multiple threads up and running. I would suggest we avoid that and instead do either

use the global_control thing from the TBB

default construct the task_scheduler_init

the global control thing would look like:

const size_t parallelism = std::thread::hardware_concurrency(); // total parallelism that TBB can utilize is cut in half for the dynamic extension // of the given scope, including calls to foo() and bar() global_control c(global_control::max_allowed_parallelism, parallelism);

the task_scheduler_init init thing would look like

task_scheduler_init default_scheduler;

since the default construction will use all threads... however, the task_scheduler_init object is deprecated since one of the more recent releases of the TBB.

This way we ensure that the tests will run with multiple threads regardless of STAN_NUM_THREADS.

What is the smallest, most self-contained, easy to accidentally understand thing?

Probably the declaration of the task scheduler init which is deprecated by now...but whatever, we are not going to update the tbb given that the tbb is rarely updated on CRAN.

In short: Please replace all calls

stan::math::init_threadpool_tbb();

with

tbb::task_scheduler_init default_scheduler;

that will allocate as many CPUs as there are on the machine where we run - regardless of the STAN_NUM_THREADS environment variable.

Should all the prim tests and such get these (not all of them had these)? Could I just do one globally in each file?

Uhm... sorry... but there is a much simpler solution here: Don't do any initialisation anywhere. That will trigger the default behaviour of the TBB which is to use all cores. So maybe just get rid of init_threadpool_tbb (which is really just there to restrict things to what is set in STAN_NUM_THREADS) and also do not declare any tbb::task_scheduler_init. Sounds good?

stan/math/rev/functor/reduce_sum.hpp

SteveBronder · 2020-04-01T16:21:37Z

That's a pretty wild error

g++ -std=c++1y -m64 -D_REENTRANT -Wall -Wno-unused-function -Wno-uninitialized -Wno-unused-but-set-variable -Wno-unused-variable -Wno-sign-compare -Wno-unused-local-typedefs -I lib/tbb_2019_U8/include -O3 -I . -I lib/eigen_3.3.3 -I lib/boost_1.72.0 -I lib/sundials_5.1.0/include -I lib/gtest_1.8.1/include -I lib/gtest_1.8.1 -D_USE_MATH_DEFINES -DBOOST_DISABLE_ASSERTS -Wl,-L,"C:/Jenkins/workspace/Math_Pipeline_PR-1813/lib/tbb" -Wl,-rpath,"C:/Jenkins/workspace/Math_Pipeline_PR-1813/lib/tbb" test/unit/math/mix/meta/scalar_type_test.o lib/gtest_1.8.1/src/gtest_main.cc lib/gtest_1.8.1/src/gtest-all.o lib/tbb/tbb.dll -static-libgcc -static-libstdc++ -o test/unit/math/mix/meta/scalar_type_test.exe

test/unit/math/mix/functor/reduce_sum_test.cpp:481:1: internal compiler error: Segmentation fault
}
^
Please submit a full bug report,
with preprocessed source if appropriate.
See http://sourceforge.net/projects/mingw-w64 for instructions.

I'm gonna break the mix up into two files to see if that helps

rok-cesnovar · 2020-04-01T16:26:16Z

Oh yeah, that file is definitely too big for the Windows g++ 4.9.3. We had this issue in the flatten.

I have a windows machine next to me. Want me to split it so you dont stab in the dark?

SteveBronder · 2020-04-01T16:48:02Z

I can probably get it, unless you think it's v confusing then I'd be happy for the assist! Is the issue just thatI'm running too much in each test or there are too many tests in each file?

rok-cesnovar · 2020-04-01T16:52:18Z

Too many tests in a file and its really hard to find (guess) where the actual limit is.

For that reason we have log_mix_part1_test and part2_test in https://github.com/stan-dev/math/tree/develop/test/unit/math/mix/fun

SteveBronder · 2020-04-01T17:01:03Z

hmm, I think we can just cut the file in half and it's fine. lemme give it as shot

SteveBronder · 2020-04-01T21:25:14Z

Ugh actually @rok-cesnovar would you be able to give me a hand with this?

wds15 · 2020-04-02T12:55:11Z

stan/math/rev/functor/reduce_sum.hpp

+template <typename ReduceFunction, typename ReturnType, typename Vec,
+          typename... Args>
+struct reduce_sum_impl<ReduceFunction, require_var_t<ReturnType>, ReturnType,
+                       Vec, Args...> {


I am afraid that this specialisation magic is not working right now correctly. I just ran the ode test model with this branch and when using reduce_sum I am not getting any speedup and Stan is also not using multiple threads.

... and a test for this would be great ... not yet sure how to test that though.

I'm a dumb dumb @bbbales2 I forgot to add reduce_sum.hpp in rev to rev/functor.hpp

No worries, this can happen - let's just put a test as suggested below to make sure we will never deploy this type of mistake.

wds15 · 2020-04-02T13:06:55Z

Ah. I forgot: Do we have tests in place which check that we get 0 for the case of an empty vmapped argument? We need that for the double and var case.

wds15 · 2020-04-07T09:30:24Z

I also have R scripts written which runs my example on the back of batchtools. This is an R package which can dispatch the work to a number of backends - including multi-core machines & clusters. Do you want to have a look at that? Only with these R scripts I was able to create comprehensive comparisons.

…1815)" This reverts commit 7f76760.

…into feature/reduce_sum

)

bbbales2 · 2020-04-07T12:07:23Z

Well I added the std::forward calls but I think that was a mistake so I removed them again and then I thought I saw a way to add them again so I added them again.

std::forward acts like std::move for rvalues, so I think it's only safe to do one forward after you're done with the variable. In this case the forwards were used on the later functions: https://github.com/stan-dev/math/pull/1813/files#diff-8283b21f45f7557cada3e4da4bc636b6R232

So I just swapped those save_varis calls to be before the recursive_reducer constructor so the recursive_reducer constructor would get to use the forwarding.

But still is this really safe @SteveBronder ? Like, we pass things into the constructor of that recursive_reducer, but then if that object gets destructed then the memory attached to those variables (if they were rvalues going in) would get freed.

TBB makes copies of the first recursive reducer to get the other reducers, but I don't think we know that it won't deallocate that first recursive reducer before it calls operator() on another one?

wds15 · 2020-04-07T12:25:27Z

TBB makes copies of the first recursive reducer to get the other reducers, but I don't think we know that it won't deallocate that first recursive reducer before it calls operator() on another one?

But we are constructing a recursive_reducer ourselves and give that to the TBB reduce. Thus, recursive_reducer which we feed with the forwarded stuff will always be in scope and will not be deallocated during the life-time of the use of this object.

It would be good if @SteveBronder can comment. I understood that forward just keeps rvalues and lvalues as to what they are; thus I would have expected that we can use forward twice on a given argument, since it just maintains the reference to the object in the form given to the function.

bbbales2 · 2020-04-07T12:27:48Z

But we are constructing a recursive_reducer ourselves and give that to the TBB reduce. Thus, recursive_reducer which we feed with the forwarded stuff will always be in scope and will not be deallocated during the life-time of the use of this object.

@wds15 aaaah, good point

stan-buildbot · 2020-04-07T17:48:01Z

Name	Old Result	New Result	Ratio	Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan	4.83	4.8	1.01	0.78% faster
low_dim_corr_gauss/low_dim_corr_gauss.stan	0.02	0.02	1.01	0.8% faster
eight_schools/eight_schools.stan	0.09	0.09	0.99	-1.24% slower
gp_regr/gp_regr.stan	0.22	0.22	1.0	0.18% faster
irt_2pl/irt_2pl.stan	6.45	6.49	0.99	-0.61% slower
performance.compilation	88.26	87.16	1.01	1.24% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan	7.51	7.55	0.99	-0.5% slower
pkpd/one_comp_mm_elim_abs.stan	20.49	20.57	1.0	-0.41% slower
sir/sir.stan	92.33	94.18	0.98	-2.0% slower
gp_regr/gen_gp_data.stan	0.05	0.05	0.98	-2.3% slower
low_dim_gauss_mix/low_dim_gauss_mix.stan	2.96	2.96	1.0	0.23% faster
pkpd/sim_one_comp_mm_elim_abs.stan	0.31	0.3	1.01	0.61% faster
arK/arK.stan	1.77	1.74	1.02	1.69% faster
arma/arma.stan	0.65	0.66	0.99	-0.94% slower
garch/garch.stan	0.51	0.51	1.0	0.29% faster
Mean result: 0.998665440316

Jenkins Console Log
Blue Ocean
Commit hash: 5bd7f91

Machine information

ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

SteveBronder · 2020-04-07T19:47:31Z

But still is this really safe @SteveBronder ? Like, we pass things into the constructor of that recursive_reducer, but then if that object gets destructed then the memory attached to those variables (if they were rvalues going in) would get freed.

Sebastian is right here, that first reducer is made and the data is stored then and deleted once it leaves scope. If we did something like return a promise from reduce_sum_impl::operator() then it would get weird but the same is true for any reference type we used.

But we are constructing a recursive_reducer ourselves and give that to the TBB reduce. Thus, recursive_reducer which we feed with the forwarded stuff will always be in scope and will not be deallocated during the life-time of the use of this object.

It would be good if @SteveBronder can comment. I understood that forward just keeps rvalues and lvalues as to what they are; thus I would have expected that we can use forward twice on a given argument, since it just maintains the reference to the object in the form given to the function.

Forward treats lvalues as not movable and rvalues as movable. That means for rvalues, if the object has a move constructor defined it will move ownership of the original object's memory and leave the initial object in a valid but unspecified state. That means you only want to forward once you don't need the original object anymore.

bbbales2 · 2020-04-07T19:50:29Z

@SteveBronder Then we're good to go? The current version does a single forward per argument: https://github.com/stan-dev/math/pull/1813/files#diff-8283b21f45f7557cada3e4da4bc636b6R223

SteveBronder · 2020-04-07T19:53:52Z

I'll give it all a review rn

wds15 · 2020-04-07T21:59:03Z

I am running over night the big benchmark once more. If that turns out to look good, then I will approve and merge unless @SteveBronder (or someone else) finds a showstopper.

bbbales2 · 2020-04-07T22:12:54Z

@wds15 sooner the better. We should all move on to trying to break @rok-cesnovar 's branch: stan-dev/stanc3#451 (comment)

…ature/reduce_sum

SteveBronder · 2020-04-07T22:58:34Z

All looks good just added a could little things ad02fdf

One thing, I added export STAN_NUM_THREADS=4 to the jenkinsfile or linux (accidentally added it for windows but removed it.

bbbales2 · 2020-04-07T23:20:20Z

Merge when the tests pass @wds15 wooooo

stan-buildbot · 2020-04-08T05:00:07Z

Name	Old Result	New Result	Ratio	Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan	4.93	4.94	1.0	-0.19% slower
low_dim_corr_gauss/low_dim_corr_gauss.stan	0.02	0.02	0.98	-2.51% slower
eight_schools/eight_schools.stan	0.09	0.09	1.0	0.16% faster
gp_regr/gp_regr.stan	0.22	0.22	1.0	0.1% faster
irt_2pl/irt_2pl.stan	6.44	6.45	1.0	-0.14% slower
performance.compilation	88.17	87.15	1.01	1.15% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan	7.52	7.53	1.0	-0.02% slower
pkpd/one_comp_mm_elim_abs.stan	20.32	20.64	0.98	-1.55% slower
sir/sir.stan	90.91	96.1	0.95	-5.71% slower
gp_regr/gen_gp_data.stan	0.05	0.05	0.96	-4.66% slower
low_dim_gauss_mix/low_dim_gauss_mix.stan	2.95	2.95	1.0	0.26% faster
pkpd/sim_one_comp_mm_elim_abs.stan	0.31	0.3	1.01	1.01% faster
arK/arK.stan	1.73	1.75	0.99	-0.74% slower
arma/arma.stan	0.66	0.66	1.0	-0.36% slower
garch/garch.stan	0.51	0.51	1.0	0.23% faster
Mean result: 0.991777537571

Jenkins Console Log
Blue Ocean
Commit hash: 8d8ebd1

Machine information

ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

wds15 · 2020-04-08T08:08:23Z

Grr... I had to find out that the benchmark crashed over night, as I had grainsize=0 which is now blocked (correctly) from the reduce_sum function. Rerunning it... ~5h from now it should be done...

wds15 · 2020-04-08T12:46:31Z

Well... our cluster is apparently busy and I am not getting results as fast as I would like to. Below is a comparison of map_rect and reduce_sum. To me this looks all good. No need to wait any longer (seems like people want to submit large chunks of stuff before the holidays):

Approving...

SteveBronder added 2 commits March 31, 2020 14:10

adds reduce_sum and tests

7fe241e

remove stan::math::test namespace for now

ec78eb6

[Jenkins] auto-formatting by clang-format version 5.0.0-3~16.04.1 (ta…

65e6526

…gs/RELEASE_500/final)

cpplint

928898b

SteveBronder and others added 2 commits March 31, 2020 23:31

merge static and normal reduce_sum tests for mix and prim

9d48fab

[Jenkins] auto-formatting by clang-format version 6.0.0 (tags/google/…

c554db8

…stable/2017-11-14)

SteveBronder and others added 6 commits March 31, 2020 23:44

make static tests for rev

c3795da

[Jenkins] auto-formatting by clang-format version 6.0.0 (tags/google/…

07991a5

…stable/2017-11-14)

cpplint

c0ff21d

Add header includes

92e5686

reduce_sum headers

acf754e

header stuff

df6ab63

wds15 requested changes Apr 1, 2020

View reviewed changes

break reduce_sum mix tests into two seperate headers

6723026

wds15 reviewed Apr 2, 2020

View reviewed changes

bbbales2 and others added 6 commits April 7, 2020 07:38

Added std::forward calls to rev version of reduce_sum (Issue #1815)

7f76760

[Jenkins] auto-formatting by clang-format version 6.0.0

2658e80

Revert "Added std::forward calls to rev version of reduce_sum (Issue #…

7f893d5

…1815)" This reverts commit 7f76760.

Merge branch 'feature/reduce_sum' of https://github.com/stan-dev/math …

870be5a

…into feature/reduce_sum

Added std::forward calls to rev version of reduce_sum again (Issue #1815

ccee1f1

)

[Jenkins] auto-formatting by clang-format version 6.0.0

5bd7f91

SteveBronder and others added 4 commits April 7, 2020 18:56

Add STAN_NUM_THREADS to jenkinsfile, inline reduce_sum operators

ad02fdf

[Jenkins] auto-formatting by clang-format version 6.0.0

dcfffc5

remove export from windows threading test

2ea9a7b

Merge branch 'feature/reduce_sum' of github.com:stan-dev/math into fe…

8d8ebd1

…ature/reduce_sum

wds15 approved these changes Apr 8, 2020

View reviewed changes

wds15 merged commit f02c6df into develop Apr 8, 2020

SteveBronder mentioned this pull request Apr 16, 2020

Stan Math 3.2 release #1826

Closed

bbbales2 mentioned this pull request Apr 20, 2020

Stanc3 release for Cmdstan 2.23 stan-dev/stanc3#498

Closed

rok-cesnovar deleted the feature/reduce_sum branch May 3, 2020 08:44

adds reduce_sum and tests #1813

adds reduce_sum and tests #1813

Conversation

SteveBronder commented Mar 31, 2020 • edited by bbbales2 Loading

Summary

Tests

Side Effects

Release notes

Checklist

SteveBronder commented Mar 31, 2020

bbbales2 commented Mar 31, 2020

bob-carpenter commented Mar 31, 2020

SteveBronder commented Mar 31, 2020

wds15 commented Mar 31, 2020

bbbales2 commented Mar 31, 2020

SteveBronder commented Mar 31, 2020

SteveBronder commented Apr 1, 2020

wds15 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wds15 Apr 2, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SteveBronder commented Apr 1, 2020 • edited Loading

rok-cesnovar commented Apr 1, 2020

SteveBronder commented Apr 1, 2020

rok-cesnovar commented Apr 1, 2020

SteveBronder commented Apr 1, 2020

SteveBronder commented Apr 1, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wds15 commented Apr 2, 2020

wds15 commented Apr 7, 2020

bbbales2 commented Apr 7, 2020

wds15 commented Apr 7, 2020

bbbales2 commented Apr 7, 2020

stan-buildbot commented Apr 7, 2020

SteveBronder commented Apr 7, 2020

bbbales2 commented Apr 7, 2020

SteveBronder commented Apr 7, 2020

wds15 commented Apr 7, 2020

bbbales2 commented Apr 7, 2020

SteveBronder commented Apr 7, 2020 • edited Loading

bbbales2 commented Apr 7, 2020

stan-buildbot commented Apr 8, 2020

wds15 commented Apr 8, 2020

wds15 commented Apr 8, 2020

SteveBronder commented Mar 31, 2020 •

edited by bbbales2

Loading

wds15 Apr 2, 2020 •

edited

Loading

SteveBronder commented Apr 1, 2020 •

edited

Loading

SteveBronder commented Apr 7, 2020 •

edited

Loading