WIP: map_rect_concurrent with parallel STL backends #1085

wds15 · 2018-12-16T16:47:19Z

I am adding WIP status as we probably need to settle on a bigger strategy as to how we move forward with thread based parallelism. Ideally we choose a single approach like the Intel TBB.

Summary

The intent is to allow for C++17 parallel STL backends to be used with map_rect_concurrent which offer speed improvements over the current existing C++11 based version. The C++17 backend is only used optionally if the user does one of:

the user defines STAN_PSTL_CPP17 during compilation => the standard C++17 backend is used
the user defines STAN_PSTL_INTEL during compilation (and in addition supplies compile arguments to build and link against Intel PSTL) => the Intel Parallel STL is used
if none of the above is done, then a C++11 implementation is used instead

What has been done in this PR is to implement a parallel version of std::for_each using C++11 language only. This implementation is used as a fall-back if neither the C++17 one is available nor the Intel Parallel STL is available. The respective code has been placed in stan/math/parallel/.

In order to test this PR with the Intel Parallel STL, please do (on a Mac):

download the binary edition of the Intel Parallel STL, from here
unpack it where you like
put into your make/local these lines:

// you need a g++ compiler which support openMP SIMD... this one is g++ v7.0 from MacPorts
CXX=g++-mp-7
CC=g++-mp-7

// adjust the TBBROOT & PSTLROOT path here (both are packaged together from Intel)
TBBROOT = $(HOME)/work/tbb2019_20181203oss
PSTLROOT = $(HOME)/work/pstl2019_20181203oss
CXXFLAGS += -std=c++17 -I$(TBBROOT)/include -I$(PSTLROOT)/include -fopenmp-simd
LDFLAGS += -L$(TBBROOT)/lib -Wl,-rpath,$(TBBROOT)/lib -ltbb -ltbbmalloc

Things should work on Linux as well... even Windows may be in scope with this as I saw somewhere that condor provides packages for it (at least this came up in one of my google searches, but I haven't found the link again). However, the Intel Threading Building Blocks on which the the Parallel STL is based, does support windows (but the binary edition for windows only includes builds for msvc).

Tests

This PR also fixes the recently discovered bug found in map_rect_concurrent whenever number of threads and jobs don't line up. The test for this is found in

test/unit/math/prim/mat/functor/map_rect_concurrent_threads_test.cpp

Since this test is OK on this branch, I have prepared a second branch which has develop unchanged, but includes this tests (feature/map_rect-fail-develop). On this branch the exact same test is failing (while it is fine on this branch).

Other tests:

test/unit/math/parallel/for_each_test.cpp to test the new for_each facility
test/unit/math/parallel/get_num_threads_test.cpp for tests to read off the number of threads

Side Effects

The C++17 standard does not specify at all how the number of threads are controlled. For the Intel Parallel STL we will need to add code to cmdstan which initializes the backend of the Intel Parallel STL accordingly (which is the Intel Threading Building Blocks).

The signature of the function get_num_threads was changed to now not anymore take as argument the number of jobs which we want to compute. The intent of the function is to read the number of threads off the environment variable STAN_NUM_THREADS and not more. This is done as we will have to add code to cmdstan which will control the number of threads for the Intel Parallel STL. In cmdstan it does not make sense to pass in the number of jobs - rather we just want to read out the STAN_NUM_THREADS.

Additional Notes

It appears that we can get the Intel TBB (the backbone of the Parallel STL) to work on Windows using MinGW! I just followed these instructions and I have a tbb.dll on my Windows machine. In short the steps are: install MinGW/add MinGW path to PATH/use the powershell to execute mingw32-make.exe compiler=gcc... thus it looks as if we will be able to get threading running on Windows with the MinGW compilers.

Checklist

Math issue Make optional use of parallel C++17 algorithms in map_rect_concurrent #1084
Copyright holder: Sebastian Weber

The copyright holder is typically you or your assignee, such as a university or company. By submitting this pull request, the copyright holder is agreeing to the license the submitted work under the following licenses:
- Code: BSD 3-clause (https://opensource.org/licenses/BSD-3-Clause)
- Documentation: CC-BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
the basic tests are passing
- unit tests pass (to run, use: ./runTests.py test/unit)
- header checks pass, (make test-headers)
- docs build, (make doxygen)
- code passes the built in C++ standards checks (make cpplint)
the code is written in idiomatic C++ and changes are documented in the doxygen
the new changes are tested

wds15 · 2018-12-16T17:23:52Z

In case it helps to have a quick hangout to explain this PR, I can jump on a call this week.

Tagging a few people who I know have seen this code .. @seantalts , @sakrejda , @bob-carpenter , @syclik ... maybe more, not sure.

bob-carpenter · 2018-12-16T18:16:00Z

I haven't actually looked at the lower-level code, but I am curious how all the user-facing config is going to play out.

I think this is going to require close scrutiny from @syclik and also @seantalts for implications for the math lib and language.

First, as a matter of policy, this is bringing C++17 code into our codebase. Presumably that's ifdef-ed out in non-C++17 environments? How is that determined and maintained?

Will we require C++17 and Intel STL for continuous integration testing? If not, how do we test these advertised options work?

I'm very confused about how all these parallelization schemes interact, but haven't been following any of them very closely. For instance, is STAN_PSTL_INTEL compatible with MPI and GPU code? And presumably there's some interaction of STAN_PSTL_CPP17 and STAN_PSTL_INTEL---for example, what happens if both are specified?

wds15 · 2018-12-16T18:57:45Z

The rules are simple, I hope:

if the user does not define anything => C++11 is used just as always
if the user declares STAN_PSTL_CPP17 => we use C++17 parallel stuff
if the user declares STAN_PSTL_INTEL => we use C++17 parallel stuff from Intel

in case the user specifies both of the PSTL macros, then the CPP17 takes precedence.

This code is orthogonal to MPI and GPU. It should just work fine together.

I am not sure if there is a need to test the Intel PSTL or the C++17 variant of this... this would be the same as asking if some construct from the standard library works. I mean... this should just work; all I am doing is using the std::for_each with execution policies.

bob-carpenter · 2018-12-16T19:36:38Z

I am not sure if there is a need to test the Intel PSTL or the C++17 variant of this... this would be the same as asking if some construct from the standard library works.

There's a bit of logic within the ifdefs for https://github.com/stan-dev/math/pull/1085/files#diff-428e9c3dd0a77766abc6060111a18771 based on platform. So it's not just a matter of who loads which standard library from the outside. There are also includes of new libraries like <pstl/execution> in the PSTL option. What happens if another file includes <execution> in the same translation unit? I'm not saying it should necessarily be tested, just that it's not automatically guaranteed to work with everything else we have.

I mean... this should just work; all I am doing is using the std::for_each with execution policies.

Right. It's more the includes and getting the typedefs and invocations right that I was concerned about. That, and just basic specification combinatorics with a couple more parallelization options.

bob-carpenter · 2018-12-16T19:37:45Z

I mean... this should just work; all I am doing is using the std::for_each with execution policies.

This is very cool how this works, by the way! It's clear what's going on looking at the file with the typedefed name resolution.

syclik

I'm just marking this as request changes so it doesn't get merged immediately. #1084 needs to be updated with information about how this affects GPU, MPI, and multi-threading. (And then we'll have to decide whether we want those interactions.)

wds15 · 2018-12-17T11:45:23Z

Ok... I updated the issue for this with a lengthy description for those interactions.

The key thing to take note of that I do not change anything in terms of how we deal with threading here. It is just a better (optional) backend which will give use massive speed gains. To see this, just follow the issue which first describes where we are now and where we go with this change.

@bob-carpenter : I am glad you like the design as it sounds. Wrt to your concern of clashes between the parallel STL and the standard C++17: From going over the sources from the Parallel STL I think they do take care of this situation. So they are cautious about this problem as it appears to me. Now... I wanted to test this by including execution from the standard library just before the respective PSTL includes... but I can't as I do not have a compiler available to me which supports execution. The current open-source compiler just don't provide this. I could (maybe) try with Intel's compiler which seems to have support for C++17 parallel features.

Now, there is one more parallel STL implementation out there which is HPX. It got my attention in this post. However, when I installed it and tried to use it the Stan program just segfaulted such that I gave up, but in theory this should be a very nice alternative for non-Intel systems. We just need to find someone who can compile it and run it (it would take me a moment).

wds15 · 2018-12-17T13:49:39Z

@seantalts ... could you have a look after the linux instance which does not seem to respond. Thanks much!

(I already restarted the tests once which I think is all I can do from my end)

bob-carpenter · 2018-12-17T14:18:06Z

On Dec 17, 2018, at 6:45 AM, wds15 ***@***.***> wrote: Ok... I updated the issue for this with a lengthy description for those interactions. The key thing to take note of that I do not change anything in terms of how we deal with threading here. It is just a better (optional) backend which will give use massive speed gains.

It uses multi-threading, right? Doesn't that mean there'll be thread contention if you use multi-threaded map_rect and this. Or if we use the OpenMP parallelism for distributions that is also being considered.

To see this, just follow the issue which first describes where we are now and where we go with this change. @bob-carpenter : I am glad you like the design as it sounds. Wrt to your concern of clashes between the parallel STL and the standard C++17: From going over the sources from the Parallel STL I think they do take care of this situation. So they are cautious about this problem as it appears to me.

How do they avoid multiple definitions if they have their own version of the library? I'd have thought we'd have to ifdef out those libraries wherever they occur, which means maybe a top-level solution for includes that everyone can use that wants those libraries. That is, a Stan math version of those libs that we include rather than the standard lib headers so that they can branch. That would introduce more combinatorics into the code.

Now... I wanted to test this by including execution from the standard library just before the respective PSTL includes... but I can't as I do not have a compiler available to me which supports execution.

Is the Intel STL the only standard lib implementation that supports C++17 std::execution?

The current open-source compiler just don't provide this. I could (maybe) try with Intel's compiler which seems to have support for C++17 parallel features. Now, there is one more parallel STL implementation out there which is HPX. It got my attention in this post. However, when I installed it and tried to use it the Stan program just segfaulted such that I gave up, but in theory this should be a very nice alternative for non-Intel systems. We just need to find someone who can compile it and run it (it would take me a moment).

I think it makes sense to get one solution running before opening up a bunch more!

wds15 · 2018-12-17T15:53:36Z

@bob-carpenter you write "It uses multi-threading, right? Doesn't that mean there'll be thread contention if you use multi-threaded map_rect and this."

I am not sure I understand what you mean. The "old" map_rect_concurrent on the basis of async does not exist anymore. The old function has been refactored to now use for_each with execution policies.

OpenMP would fire off independent threads. The Intel Parallel STL has been written such that it seamlessly integrates with existing applications. In case a compiler supports OpenMP and the parallel C++17 STL, then that will work together as well.

The Intel Parallel STL has a number of macros which check what has been defined, what compiler is used and all that.

Current compilers I am aware of which support the parallel C++17 stuff are

msvc
Intel compilers
Parallel STL libraries:
Intel PSTL
HPX
there are a few more

Intel has offered to the GCC and to the libc++ project their Parallel STL library to be used as basis for their C++17 implementation. As far as I am informed the GCC project will do so and we will probably see this in g++ 9.0. Don't know what libc++ (clang) will do.

Hooking up HPX is really a matter of compiling it (in a way that it works), including their includes and then defining #define hpx::parallel std_par - nothing more to be done from the Stan side (OK, limiting the number of threads is still needed somewhere in cmdstan).

The nice thing about using the C++17 standard is that everyone is adhering to it (in theory).

seantalts · 2018-12-17T21:07:25Z

Would it be crazy to just print an error if more than 1 of GPU, MPI, or THREADING are turned on in the makefiles? The only usecase I can think of is maybe a threading one where maybe you want to try to offload only a small-ish part of the likelihood to the GPU, and I think we should probably try to design for that case specifically (in the future / separate PR) instead of estimating that it should just work fine... It would really help me and others understand the boundaries of what should be going on, I think.

wds15 · 2018-12-17T21:11:42Z

I gave a detailed description of my understandig of what goes together how & under what conditions (to my understanding) in the issue for this pr....maybe we should copy that into a wiki page? This question comes up very often.

seantalts · 2018-12-17T21:41:01Z

I did read that :) Just to clarify, I'm talking about a general solution to the "How do these compile flags / modes interact" perennial question as well, and not something the code in this PR should address.

Just to highlight, I think our current understanding is a little thorny in the following areas:

in case map_rect is called and each job is accessing the GPU resource, then this needs locking of the GPU ressource... I do not know how the GPU code is organized here, maybe @rok-cesnovar can comment. However, this is nothing new.

it is still possible to make a for_each parallel call... this is only safe if the AD stack is not touched and the GPU resource is thread-safe.

what does not work is to nest MPI calls inside of threads. So if we have multiple threads and each thread wants to make MPI calls, then this will fail with the current setup. It is possible to my knowledge to make this work - but this cannot happen in any Stan program as of now. This constellation is impossible to achieve with a given Stan program. The only situation this can arise is if an interface decides to sample multiple chains using threads and then allows MPI as a backend. This won't wok at the moment.

Mostly things either do not work or we do not know how they might work. I am here noting this and stating that it's actually fine if none of these interactions work for now, and in fact declaring this explicitly would help everyone involved and greatly ease what we're trying to communicate to users. So I propose we say for now that no interactions between these modes and their make flags is currently supported and we introduce somewhere a little error message if someone tries to use more than one at once. We can also if desired add a github issue for any use-cases any of us are able to think up where an interaction mode is useful (ideally with a model example).

wds15 · 2018-12-17T22:00:44Z

Sounds good to me...each of these features is a killer in its own. Starting with the policy that we don‘t intend the interactions to work is fine for...in particular if we give errors about this.

bob-carpenter · 2018-12-17T22:03:39Z

I see---I misse that this was a replacement for map_rect_concurrent. I should've noticed it in the title! I thought this was another way to add parallelism. I also agree that following standards is good. The only thing I worry about here is that it's a standard we're not yet supporting. And things like saying we can't test code we're distributing and encouraging people to use.

wds15 · 2018-12-18T08:04:19Z

I see what you mean... the Intel Parallel STL and C++17 are optional, just to make sure. Now, testing on our systems with the Intel Parallel STL should be straightforward to setup. We could download the binary editions of the Parallel STL for Mac which work out of the box for me. The Linux binaries hopefully also work out of the box. That would be yet another thing to test - something to decide during the review.

wds15 · 2019-01-27T09:58:55Z

Taking this down as we are planning to go with the Intel TBB straight.

wds15 added 6 commits December 16, 2018 16:00

enable parallel STL and C++17 parallelism

bf0be4c

adjust num_threads_test

42b3441

test thread map_rect with varying number of threads to use

3084a1c

add tests for stan c++11 for_each implementation

5eb2ca5

align test with failing case on develop (this branch is ok)

6e60845

cpplint

ff0953e

eliminate need for extra lambda declaration

4ec2416

syclik requested changes Dec 17, 2018

View reviewed changes

wds15 changed the title ~~map_rect_concurrent with parallel STL backends~~ WIP: map_rect_concurrent with parallel STL backends Jan 7, 2019

wds15 closed this Jan 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: map_rect_concurrent with parallel STL backends #1085

WIP: map_rect_concurrent with parallel STL backends #1085

wds15 commented Dec 16, 2018 •

edited

Loading

wds15 commented Dec 16, 2018

bob-carpenter commented Dec 16, 2018

wds15 commented Dec 16, 2018

bob-carpenter commented Dec 16, 2018 via email

bob-carpenter commented Dec 16, 2018 via email

syclik left a comment

wds15 commented Dec 17, 2018

wds15 commented Dec 17, 2018

bob-carpenter commented Dec 17, 2018 via email

wds15 commented Dec 17, 2018

seantalts commented Dec 17, 2018

wds15 commented Dec 17, 2018

seantalts commented Dec 17, 2018

wds15 commented Dec 17, 2018

bob-carpenter commented Dec 17, 2018 via email

wds15 commented Dec 18, 2018

wds15 commented Jan 27, 2019

WIP: map_rect_concurrent with parallel STL backends #1085

WIP: map_rect_concurrent with parallel STL backends #1085

Conversation

wds15 commented Dec 16, 2018 • edited Loading

Summary

Tests

Side Effects

Additional Notes

Checklist

wds15 commented Dec 16, 2018

bob-carpenter commented Dec 16, 2018

wds15 commented Dec 16, 2018

bob-carpenter commented Dec 16, 2018 via email

bob-carpenter commented Dec 16, 2018 via email

syclik left a comment

Choose a reason for hiding this comment

wds15 commented Dec 17, 2018

wds15 commented Dec 17, 2018

bob-carpenter commented Dec 17, 2018 via email

wds15 commented Dec 17, 2018

seantalts commented Dec 17, 2018

wds15 commented Dec 17, 2018

seantalts commented Dec 17, 2018

wds15 commented Dec 17, 2018

bob-carpenter commented Dec 17, 2018 via email

wds15 commented Dec 18, 2018

wds15 commented Jan 27, 2019

wds15 commented Dec 16, 2018 •

edited

Loading