[WIP] Parallel Prototype #1616

SteveBronder · 2020-01-15T01:03:57Z

This is just a PR for discussion of the tbb parallelism prototype

From Ben's last comment on discourse chat below

https://discourse.mc-stan.org/t/parallel-autodiff-v3/12474/40

Alright, stuff is up (proto-parallel-v3...cleanup/proto-parallel-v3).

I screwed up autodiff on the sliced arguments. I was scared about what would happen if all the vars in the sliced arguments had the same vari, so I just didn’t implement that.

I broke the things into two functions – parallel_sum_var and parallel_sum. I think we need enable_if logic or something to distinguish between these things properly.

I got rid of the init argument.

I got rid of local_operands_and_partials.

Arguments are stored as a tuple of const references:

math/stan/math/rev/functor/reduce_sum.hpp

Line 24 in cbcf42f

std::tuple<const Args&...> args_tuple_;

Adjoints are accumulated in an eigen vector:

math/stan/math/rev/functor/reduce_sum.hpp

Line 27 in cbcf42f

Eigen::VectorXd args_adjoints_;

The member function ‘deep_copy’ of the recursive_reducer that handles the deep copies:

math/stan/math/rev/functor/reduce_sum.hpp

Line 114 in cbcf42f

auto args_tuple_local_copy = apply([&](auto&&... args) {

The member function ‘accumulate_adjoints’ collects the the adjoints from the nested autodiff and arranges them in the Eigen::VectorXd:

math/stan/math/rev/functor/reduce_sum.hpp

Line 131 in cbcf42f

accumulate_adjoints(args_adjoints_.data(), args...);

The reduce function accumulates the sum and the adjoint:

math/stan/math/rev/functor/reduce_sum.hpp

Line 140 in cbcf42f

void join(const recursive_reducer& rhs) {

It seems like operator() of a single recursive_reducer might get called multiple times though? I was surprised by this.

The member function count_memory of parallel_sum_rev_impl counts the number of vars in all the input arguments so it can allocate memory to store the vari pointers:

math/stan/math/rev/functor/reduce_sum.hpp

Line 217 in cbcf42f

const std::size_t num_terms = count_var(args...);

The member function save_varis of parallel_sum_rev_impl copies the vari pointers from the arguments into an array for the precomputed vari function:

math/stan/math/rev/functor/reduce_sum.hpp

Line 236 in cbcf42f

save_varis(varis, args...);

This should allow for Stan arguments real, int, real[], int[], vector, row_vector, matrix to be passed as the shared arguments. I think we can generalize to get arrays of arrays of whatever without too much trouble, but given I’m not actually testing much of any of that, I didn’t want to push the features yet.

It would require changes to deep_copy/count/accumulate_adjoints/save_varis.

We need to go through this and think out the references/forwarding stuff. I’ve forgotten how && works and if we need it. I wasn’t strictly being const correct when I coded this as first and it led to memory problems cause the deep copy wasn’t actually copying, so gotta be careful with that stuff I guess.

I had an implementation of apply in an old pull that got closed so I added it as a function here: https://github.com/stan-dev/math/blob/cbcf42fad798015b317ba6dab7a6ddc5d9983aa2/stan/math/prim/scal/functor/apply.hpp . It comes with tests too. I think I looked around at a bunch of stuff when I did this one and tried to get the forwarding right or whatnot. I should probably have a citation somewhere in the source code :/.

…into proto-parallel-v3

…rograms to handle general case

…used

bbbales2 · 2020-03-23T16:39:42Z

@wds15 @SteveBronder I got initial docs for this up at stan-dev/docs#161, and a very small case study here: https://github.com/bbbales2/cmdstan_map_rect_tutorial/blob/reduce_sum/reduce_sum_tutorial.Rmd (copied from a Richard McElreath example @mitzimorris recommended: https://github.com/rmcelreath/cmdstan_map_rect_tutorial/blob/master/README.md). Example turned out to have a 7x speedup on 8 cores which is convenient lol, but maybe a bit too ambitious :/.

Feel free to edit these docs and push as you see fit. It's all first-pass stuff.

rok-cesnovar · 2020-03-23T16:53:50Z

Nice! As for the Windows threading question in there: Yes, threading works just fine on Windows.

bbbales2 · 2020-03-24T15:43:22Z

@SteveBronder adj_jac_apply is in. More pulls more pulls :D!

wds15 · 2020-03-25T15:16:29Z

@bbbales2 cool doc for users which you started. I forked that, done some additions here and there and filed a PR against your repository. Have a look if you like that.

wds15 · 2020-03-25T22:02:45Z

@SteveBronder you put together a nice todo list to get this in...but I don't find that any more. Still, I recall my name showed up for some doc...what exactly you think I should write up?

SteveBronder · 2020-03-25T22:11:05Z

When I went into doc stuff it was mostly just the stuff I knew / could dictate from the code so I wasn't sure if a lot of it was nuanced enough. If you can read through it and add additional info you think is good that would be cool.

bbbales2 · 2020-03-25T22:14:49Z

@wds15 The list I got from Mitzi was (I'm heavily editing it to reflect things I did):

1. Design doc comments -- questions and promissary notes should be nailed down/cleaned up in order to put this into the 2.23 or any release.

2. Stan User's Guide

3. Stan Language Reference Manual

4. Stan Functions Reference

5. Case study - I suggest redoing this using `reduce_sum`:
https://github.com/rmcelreath/cmdstan_map_rect_tutorial

6. Unit tests and end-to-end testing

I think the docs wait on the design-docs. (edit: feel free to go edit them, just warning that a large bit of the docs are copied from the design-docs, so those changes will cascade downwards)

Inline docs, feel free to work on. In terms of code, we need a branch with the accumulate_adjoints/save_varis/deep_copy_vars/count_vars functions, and also we need tests and checks added to reduce_sum to validate the grainsize argument.

If you could double check that we have all the necessary tests in place for reduce_sum that'd be cool too. I kinda just threw those together and didn't think much more about them.

SteveBronder · 2020-03-25T22:26:05Z

I'm personally just talking about inline docs.

we need a branch with the accumulate_adjoints/save_varis/deep_copy_vars/count_vars functions, and also we need tests and checks added to reduce_sum to validate the grainsize argument.

These are in #1800

…llel-v3

bbbales2 · 2020-03-26T13:58:43Z

When I merged in develop it had me delete these tests from the Jenkins file:

stage('Linux Unit with MPI') {
  agent { label 'linux && mpi' } 
  steps {
    deleteDir()
    unstash 'MathSetup'
    sh "echo CXX=${MPICXX} >> make/local"
    sh "echo CXX_TYPE=gcc >> make/local"
    sh "echo STAN_MPI=true >> make/local"
    runTests("test/unit")
  }
  post { always { retry(3) { deleteDir() } } }
}

I'm just posting this here so I remember to investigate why those disappeared later. Felt weird deleting them manually here. Usually these git merges happen automatically and I wouldn't worry about it.

rok-cesnovar · 2020-03-26T14:17:03Z

Probably due to PR #1789

This PR (parallel prototype) introduces some Jenkinsfile formatting for some reason. Hence the conflict.

bbbales2 · 2020-03-26T14:20:26Z

@rok-cesnovar thanks, yeah that looks like it. I just didn't want to accidentally mess anything up!

…rs of different things for reduce_sum (for design-doc pull request #17)

…gs/RELEASE_500/final)

bbbales2 · 2020-03-26T14:52:01Z

@wds15 @SteveBronder alright I think I got the right checks and tests in for reduce_sum. It should be complete.

bbbales2 · 2020-03-26T14:52:36Z

(of course it's probably not complete I'm just saying it's fair to go through it with a fine toothed comb and if anything is missing it's a problem)

bbbales2 · 2020-03-26T18:08:46Z

Ooops forgot the deterministic version of the function. Can get that later today.

SteveBronder · 2020-03-27T16:38:50Z

Update save_vari and deep_copy tests (@SteveBronder / @bbbales2)
Update docs to explain tuple setups (@SteveBronder)
Add deterministic parameter (@bbbales2)
Update example with to note we don't need sharding and other updates (@bbbales2 / @wds15)
Add example reduce_sum models to cmdstan-performance-tests (@wds15)
Final Reduce Sum PR (@SteveBronder)
stanc3 compiler (@SteveBronder / @rok-cesnovar)

wds15 · 2020-03-27T18:08:20Z

I think I have an idea how to test the deterministic thing. The promise is that the splits are always of the same size and at the same locations. We can create a mock reducer object which raises an exception if our expectations are not met. Does that make sense?

bbbales2 · 2020-03-27T18:10:21Z

That is a good point. That is what we want to guarantee. I'll try this.

…re predictable behavior (design doc pull request #17)

…ev/math into cleanup/proto-parallel-v3

…gs/RELEASE_500/final)

bbbales2 · 2020-03-27T20:37:22Z

I added reduce_sum_static. The simple_partitioner was the simplest, and just breaks the work size until each individual piece is less than or equal to the grainsize. That's the limit of the control we have there.

I called it reduce_sum_static but since we're not using the tbb static_partitioner (that's a different thing) we should probably rename it.

I wasn't in on deterministic cause it makes the execution sound deterministic. But the breaking into pieces should be deterministic. But maybe still reduce_sum_deterministic is better. Not sure.

…llel-v3

SteveBronder · 2020-04-01T19:08:06Z

Closing this since the last PR is open

wds15 added 30 commits December 30, 2019 17:27

prim version

8099c90

make nested stuff work

7027c12

extend signature and test nested parallel AD

302cd71

add hierarchical example

9740b45

add recover_memory_global

e47ca13

Merge branch 'proto-parallel-v3' of https://github.com/stan-dev/math …

e94be85

…into proto-parallel-v3

add file

9565db9

Merge branch 'proto-parallel-v3' of https://github.com/stan-dev/math …

6faa614

…into proto-parallel-v3

fix

acbb3dc

Merge branch 'proto-parallel-v3' of https://github.com/stan-dev/math …

ca762fe

…into proto-parallel-v3

fix

bc2a601

remove debugging msg

eec7472

omit recover_memory_global which is not needed

2d696c8

Merge branch 'proto-parallel-v3' of https://github.com/stan-dev/math …

8a1d99e

…into proto-parallel-v3

aggregate more efficiently the partial sums

6d8ad8a

simplify how values are copied

1a54bde

const correctness

f689c93

make code more generic (some meta magic bits are missing)

0da9de0

make parallel reduce sum work with posted example... need more meta-p…

73fb5ae

…rograms to handle general case

rename to reduce_sum

e7a83c6

add up to 4 arguments for reduce function

2a370e7

make arguments optional

bd13907

more doc and const declares

e1deb4f

doc

358525c

Merge remote-tracking branch 'origin/develop' into proto-parallel-v3

13641ea

generalize possible input data structures

5f76ca2

refactor such that any data strcuture (contained in an array) can be …

d9e5276

…used

fix looping order error

5e01bf0

still struggling with performance regression

f2ed8a9

Merge remote-tracking branch 'origin/develop' into proto-parallel-v3

a3b4ad4

SteveBronder mentioned this pull request Mar 24, 2020

Adds auxilary functions needed for reduce_sum #1800

Merged

5 tasks

Merge remote-tracking branch 'origin/develop' into cleanup/proto-para…

6025286

…llel-v3

bbbales2 and others added 2 commits March 26, 2020 10:48

Added grainsize check, grainsize tests, and more tests for std::vecto…

4c30bc1

…rs of different things for reduce_sum (for design-doc pull request #17)

[Jenkins] auto-formatting by clang-format version 5.0.0-3~16.04.1 (ta…

c702bd7

…gs/RELEASE_500/final)

bbbales2 and others added 4 commits March 27, 2020 16:32

Added reduce_sum_static which uses tbb::simple_partitioner to have mo…

c1fd34f

…re predictable behavior (design doc pull request #17)

Merge branch 'cleanup/proto-parallel-v3' of https://github.com/stan-d…

1dcad2a

…ev/math into cleanup/proto-parallel-v3

Merge commit 'b6134fbf1a75d9bfa4716bafc8ced948b794f4b3' into HEAD

564ec5e

[Jenkins] auto-formatting by clang-format version 5.0.0-3~16.04.1 (ta…

6743ff6

…gs/RELEASE_500/final)

Merge remote-tracking branch 'origin/develop' into cleanup/proto-para…

da8332b

…llel-v3

SteveBronder mentioned this pull request Mar 31, 2020

adds reduce_sum and tests #1813

Merged

5 tasks

SteveBronder closed this Apr 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Parallel Prototype #1616

[WIP] Parallel Prototype #1616

SteveBronder commented Jan 15, 2020

bbbales2 commented Mar 23, 2020

rok-cesnovar commented Mar 23, 2020

bbbales2 commented Mar 24, 2020

wds15 commented Mar 25, 2020

wds15 commented Mar 25, 2020

SteveBronder commented Mar 25, 2020

bbbales2 commented Mar 25, 2020 •

edited

Loading

SteveBronder commented Mar 25, 2020

bbbales2 commented Mar 26, 2020

rok-cesnovar commented Mar 26, 2020 •

edited

Loading

bbbales2 commented Mar 26, 2020

bbbales2 commented Mar 26, 2020

bbbales2 commented Mar 26, 2020

bbbales2 commented Mar 26, 2020

SteveBronder commented Mar 27, 2020 •

edited by bbbales2

Loading

wds15 commented Mar 27, 2020

bbbales2 commented Mar 27, 2020

bbbales2 commented Mar 27, 2020

SteveBronder commented Apr 1, 2020

[WIP] Parallel Prototype #1616

[WIP] Parallel Prototype #1616

Conversation

SteveBronder commented Jan 15, 2020

bbbales2 commented Mar 23, 2020

rok-cesnovar commented Mar 23, 2020

bbbales2 commented Mar 24, 2020

wds15 commented Mar 25, 2020

wds15 commented Mar 25, 2020

SteveBronder commented Mar 25, 2020

bbbales2 commented Mar 25, 2020 • edited Loading

SteveBronder commented Mar 25, 2020

bbbales2 commented Mar 26, 2020

rok-cesnovar commented Mar 26, 2020 • edited Loading

bbbales2 commented Mar 26, 2020

bbbales2 commented Mar 26, 2020

bbbales2 commented Mar 26, 2020

bbbales2 commented Mar 26, 2020

SteveBronder commented Mar 27, 2020 • edited by bbbales2 Loading

wds15 commented Mar 27, 2020

bbbales2 commented Mar 27, 2020

bbbales2 commented Mar 27, 2020

SteveBronder commented Apr 1, 2020

bbbales2 commented Mar 25, 2020 •

edited

Loading

rok-cesnovar commented Mar 26, 2020 •

edited

Loading

SteveBronder commented Mar 27, 2020 •

edited by bbbales2

Loading