Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Parallel Prototype #1616

Closed
wants to merge 135 commits into from
Closed

Conversation

SteveBronder
Copy link
Collaborator

This is just a PR for discussion of the tbb parallelism prototype

From Ben's last comment on discourse chat below

https://discourse.mc-stan.org/t/parallel-autodiff-v3/12474/40

Alright, stuff is up (proto-parallel-v3...cleanup/proto-parallel-v3).

I screwed up autodiff on the sliced arguments. I was scared about what would happen if all the vars in the sliced arguments had the same vari, so I just didn’t implement that.

I broke the things into two functions – parallel_sum_var and parallel_sum. I think we need enable_if logic or something to distinguish between these things properly.

I got rid of the init argument.

I got rid of local_operands_and_partials.

Arguments are stored as a tuple of const references:

std::tuple<const Args&...> args_tuple_;

Adjoints are accumulated in an eigen vector:

Eigen::VectorXd args_adjoints_;

The member function ‘deep_copy’ of the recursive_reducer that handles the deep copies:

auto args_tuple_local_copy = apply([&](auto&&... args) {

The member function ‘accumulate_adjoints’ collects the the adjoints from the nested autodiff and arranges them in the Eigen::VectorXd:
accumulate_adjoints(args_adjoints_.data(), args...);

The reduce function accumulates the sum and the adjoint:

void join(const recursive_reducer& rhs) {

It seems like operator() of a single recursive_reducer might get called multiple times though? I was surprised by this.

The member function count_memory of parallel_sum_rev_impl counts the number of vars in all the input arguments so it can allocate memory to store the vari pointers:

const std::size_t num_terms = count_var(args...);

The member function save_varis of parallel_sum_rev_impl copies the vari pointers from the arguments into an array for the precomputed vari function:

save_varis(varis, args...);

This should allow for Stan arguments real, int, real[], int[], vector, row_vector, matrix to be passed as the shared arguments. I think we can generalize to get arrays of arrays of whatever without too much trouble, but given I’m not actually testing much of any of that, I didn’t want to push the features yet.

It would require changes to deep_copy/count/accumulate_adjoints/save_varis.

We need to go through this and think out the references/forwarding stuff. I’ve forgotten how && works and if we need it. I wasn’t strictly being const correct when I coded this as first and it led to memory problems cause the deep copy wasn’t actually copying, so gotta be careful with that stuff I guess.

I had an implementation of apply in an old pull that got closed so I added it as a function here: https://github.com/stan-dev/math/blob/cbcf42fad798015b317ba6dab7a6ddc5d9983aa2/stan/math/prim/scal/functor/apply.hpp . It comes with tests too. I think I looked around at a bunch of stuff when I did this one and tried to get the forwarding right or whatnot. I should probably have a citation somewhere in the source code :/.

@bbbales2
Copy link
Member

@wds15 @SteveBronder I got initial docs for this up at stan-dev/docs#161, and a very small case study here: https://github.com/bbbales2/cmdstan_map_rect_tutorial/blob/reduce_sum/reduce_sum_tutorial.Rmd (copied from a Richard McElreath example @mitzimorris recommended: https://github.com/rmcelreath/cmdstan_map_rect_tutorial/blob/master/README.md). Example turned out to have a 7x speedup on 8 cores which is convenient lol, but maybe a bit too ambitious :/.

Feel free to edit these docs and push as you see fit. It's all first-pass stuff.

@rok-cesnovar
Copy link
Member

Nice! As for the Windows threading question in there: Yes, threading works just fine on Windows.

@bbbales2
Copy link
Member

@SteveBronder adj_jac_apply is in. More pulls more pulls :D!

@wds15
Copy link
Contributor

wds15 commented Mar 25, 2020

@bbbales2 cool doc for users which you started. I forked that, done some additions here and there and filed a PR against your repository. Have a look if you like that.

@wds15
Copy link
Contributor

wds15 commented Mar 25, 2020

@SteveBronder you put together a nice todo list to get this in...but I don't find that any more. Still, I recall my name showed up for some doc...what exactly you think I should write up?

@SteveBronder
Copy link
Collaborator Author

When I went into doc stuff it was mostly just the stuff I knew / could dictate from the code so I wasn't sure if a lot of it was nuanced enough. If you can read through it and add additional info you think is good that would be cool.

@bbbales2
Copy link
Member

bbbales2 commented Mar 25, 2020

@wds15 The list I got from Mitzi was (I'm heavily editing it to reflect things I did):

1. Design doc comments -- questions and promissary notes should be nailed down/cleaned up in order to put this into the 2.23 or any release.

2. Stan User's Guide

3. Stan Language Reference Manual

4. Stan Functions Reference

5. Case study - I suggest redoing this using `reduce_sum`:
https://github.com/rmcelreath/cmdstan_map_rect_tutorial

6. Unit tests and end-to-end testing

I think the docs wait on the design-docs. (edit: feel free to go edit them, just warning that a large bit of the docs are copied from the design-docs, so those changes will cascade downwards)

Inline docs, feel free to work on. In terms of code, we need a branch with the accumulate_adjoints/save_varis/deep_copy_vars/count_vars functions, and also we need tests and checks added to reduce_sum to validate the grainsize argument.

If you could double check that we have all the necessary tests in place for reduce_sum that'd be cool too. I kinda just threw those together and didn't think much more about them.

@SteveBronder
Copy link
Collaborator Author

I'm personally just talking about inline docs.

we need a branch with the accumulate_adjoints/save_varis/deep_copy_vars/count_vars functions, and also we need tests and checks added to reduce_sum to validate the grainsize argument.

These are in #1800

@bbbales2
Copy link
Member

When I merged in develop it had me delete these tests from the Jenkins file:

stage('Linux Unit with MPI') {
  agent { label 'linux && mpi' } 
  steps {
    deleteDir()
    unstash 'MathSetup'
    sh "echo CXX=${MPICXX} >> make/local"
    sh "echo CXX_TYPE=gcc >> make/local"
    sh "echo STAN_MPI=true >> make/local"
    runTests("test/unit")
  }
  post { always { retry(3) { deleteDir() } } }
} 

I'm just posting this here so I remember to investigate why those disappeared later. Felt weird deleting them manually here. Usually these git merges happen automatically and I wouldn't worry about it.

@rok-cesnovar
Copy link
Member

rok-cesnovar commented Mar 26, 2020

Probably due to PR #1789

This PR (parallel prototype) introduces some Jenkinsfile formatting for some reason. Hence the conflict.

@bbbales2
Copy link
Member

@rok-cesnovar thanks, yeah that looks like it. I just didn't want to accidentally mess anything up!

bbbales2 and others added 2 commits March 26, 2020 10:48
@bbbales2
Copy link
Member

@wds15 @SteveBronder alright I think I got the right checks and tests in for reduce_sum. It should be complete.

@bbbales2
Copy link
Member

(of course it's probably not complete I'm just saying it's fair to go through it with a fine toothed comb and if anything is missing it's a problem)

@bbbales2
Copy link
Member

Ooops forgot the deterministic version of the function. Can get that later today.

@SteveBronder
Copy link
Collaborator Author

SteveBronder commented Mar 27, 2020

@wds15
Copy link
Contributor

wds15 commented Mar 27, 2020

I think I have an idea how to test the deterministic thing. The promise is that the splits are always of the same size and at the same locations. We can create a mock reducer object which raises an exception if our expectations are not met. Does that make sense?

@bbbales2
Copy link
Member

That is a good point. That is what we want to guarantee. I'll try this.

@bbbales2
Copy link
Member

I added reduce_sum_static. The simple_partitioner was the simplest, and just breaks the work size until each individual piece is less than or equal to the grainsize. That's the limit of the control we have there.

I called it reduce_sum_static but since we're not using the tbb static_partitioner (that's a different thing) we should probably rename it.

I wasn't in on deterministic cause it makes the execution sound deterministic. But the breaking into pieces should be deterministic. But maybe still reduce_sum_deterministic is better. Not sure.

@SteveBronder SteveBronder mentioned this pull request Mar 31, 2020
5 tasks
@SteveBronder
Copy link
Collaborator Author

Closing this since the last PR is open

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants