-
-
Notifications
You must be signed in to change notification settings - Fork 183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: map_rect_concurrent with parallel STL backends #1085
Conversation
In case it helps to have a quick hangout to explain this PR, I can jump on a call this week. Tagging a few people who I know have seen this code .. @seantalts , @sakrejda , @bob-carpenter , @syclik ... maybe more, not sure. |
I haven't actually looked at the lower-level code, but I am curious how all the user-facing config is going to play out. I think this is going to require close scrutiny from @syclik and also @seantalts for implications for the math lib and language. First, as a matter of policy, this is bringing C++17 code into our codebase. Presumably that's ifdef-ed out in non-C++17 environments? How is that determined and maintained? Will we require C++17 and Intel STL for continuous integration testing? If not, how do we test these advertised options work? I'm very confused about how all these parallelization schemes interact, but haven't been following any of them very closely. For instance, is |
The rules are simple, I hope:
in case the user specifies both of the PSTL macros, then the This code is orthogonal to MPI and GPU. It should just work fine together. I am not sure if there is a need to test the Intel PSTL or the C++17 variant of this... this would be the same as asking if some construct from the standard library works. I mean... this should just work; all I am doing is using the |
I am not sure if there is a need to test the Intel PSTL or the C++17 variant of this... this would be the same as asking if some construct from the standard library works.
There's a bit of logic within the ifdefs for
https://github.com/stan-dev/math/pull/1085/files#diff-428e9c3dd0a77766abc6060111a18771
based on platform. So it's not just a matter of who loads
which standard library from the outside.
There are also includes of new libraries like <pstl/execution>
in the PSTL option. What happens if another file includes <execution>
in the same translation unit?
I'm not saying it should necessarily be tested, just that it's
not automatically guaranteed to work with everything else we have.
I mean... this should just work; all I am doing is using the std::for_each with execution policies.
Right. It's more the includes and getting the typedefs and invocations
right that I was concerned about.
That, and just basic specification combinatorics with a couple more
parallelization options.
|
I mean... this should just work; all I am doing is using the std::for_each with execution policies.
This is very cool how this works, by the way! It's clear what's
going on looking at the file with the typedefed name resolution.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just marking this as request changes
so it doesn't get merged immediately. #1084 needs to be updated with information about how this affects GPU, MPI, and multi-threading. (And then we'll have to decide whether we want those interactions.)
Ok... I updated the issue for this with a lengthy description for those interactions. The key thing to take note of that I do not change anything in terms of how we deal with threading here. It is just a better (optional) backend which will give use massive speed gains. To see this, just follow the issue which first describes where we are now and where we go with this change. @bob-carpenter : I am glad you like the design as it sounds. Wrt to your concern of clashes between the parallel STL and the standard C++17: From going over the sources from the Parallel STL I think they do take care of this situation. So they are cautious about this problem as it appears to me. Now... I wanted to test this by including Now, there is one more parallel STL implementation out there which is HPX. It got my attention in this post. However, when I installed it and tried to use it the Stan program just segfaulted such that I gave up, but in theory this should be a very nice alternative for non-Intel systems. We just need to find someone who can compile it and run it (it would take me a moment). |
@seantalts ... could you have a look after the linux instance which does not seem to respond. Thanks much! (I already restarted the tests once which I think is all I can do from my end) |
On Dec 17, 2018, at 6:45 AM, wds15 ***@***.***> wrote:
Ok... I updated the issue for this with a lengthy description for those interactions.
The key thing to take note of that I do not change anything in terms of how we deal with threading here. It is just a better (optional) backend which will give use massive speed gains.
It uses multi-threading, right? Doesn't that mean there'll be thread contention
if you use multi-threaded map_rect and this. Or if we use the OpenMP parallelism for
distributions that is also being considered.
To see this, just follow the issue which first describes where we are now and where we go with this change.
@bob-carpenter : I am glad you like the design as it sounds. Wrt to your concern of clashes between the parallel STL and the standard C++17: From going over the sources from the Parallel STL I think they do take care of this situation. So they are cautious about this problem as it appears to me.
How do they avoid multiple definitions if they have their own version of the library? I'd have thought we'd have to ifdef out those libraries wherever they occur, which means maybe a top-level solution for includes that everyone can use that wants those libraries. That is, a Stan math version of those libs that we include rather than the standard lib headers so that they can branch. That would introduce more combinatorics into the code.
Now... I wanted to test this by including execution from the standard library just before the respective PSTL includes... but I can't as I do not have a compiler available to me which supports execution.
Is the Intel STL the only standard lib implementation that supports C++17 std::execution?
The current open-source compiler just don't provide this. I could (maybe) try with Intel's compiler which seems to have support for C++17 parallel features.
Now, there is one more parallel STL implementation out there which is HPX. It got my attention in this post. However, when I installed it and tried to use it the Stan program just segfaulted such that I gave up, but in theory this should be a very nice alternative for non-Intel systems. We just need to find someone who can compile it and run it (it would take me a moment).
I think it makes sense to get one solution running before opening up a bunch more!
|
@bob-carpenter you write "It uses multi-threading, right? Doesn't that mean there'll be thread contention if you use multi-threaded map_rect and this." I am not sure I understand what you mean. The "old" OpenMP would fire off independent threads. The Intel Parallel STL has been written such that it seamlessly integrates with existing applications. In case a compiler supports OpenMP and the parallel C++17 STL, then that will work together as well. The Intel Parallel STL has a number of macros which check what has been defined, what compiler is used and all that. Current compilers I am aware of which support the parallel C++17 stuff are
Intel has offered to the GCC and to the libc++ project their Parallel STL library to be used as basis for their C++17 implementation. As far as I am informed the GCC project will do so and we will probably see this in g++ 9.0. Don't know what libc++ (clang) will do. Hooking up HPX is really a matter of compiling it (in a way that it works), including their includes and then defining The nice thing about using the C++17 standard is that everyone is adhering to it (in theory). |
Would it be crazy to just print an error if more than 1 of GPU, MPI, or THREADING are turned on in the makefiles? The only usecase I can think of is maybe a threading one where maybe you want to try to offload only a small-ish part of the likelihood to the GPU, and I think we should probably try to design for that case specifically (in the future / separate PR) instead of estimating that it should just work fine... It would really help me and others understand the boundaries of what should be going on, I think. |
I gave a detailed description of my understandig of what goes together how & under what conditions (to my understanding) in the issue for this pr....maybe we should copy that into a wiki page? This question comes up very often. |
I did read that :) Just to clarify, I'm talking about a general solution to the "How do these compile flags / modes interact" perennial question as well, and not something the code in this PR should address. Just to highlight, I think our current understanding is a little thorny in the following areas:
Mostly things either do not work or we do not know how they might work. I am here noting this and stating that it's actually fine if none of these interactions work for now, and in fact declaring this explicitly would help everyone involved and greatly ease what we're trying to communicate to users. So I propose we say for now that no interactions between these modes and their make flags is currently supported and we introduce somewhere a little error message if someone tries to use more than one at once. We can also if desired add a github issue for any use-cases any of us are able to think up where an interaction mode is useful (ideally with a model example). |
Sounds good to me...each of these features is a killer in its own. Starting with the policy that we don‘t intend the interactions to work is fine for...in particular if we give errors about this. |
I see---I misse that this was a replacement for map_rect_concurrent. I should've noticed it in the title! I thought this was another way to add parallelism.
I also agree that following standards is good. The only thing I worry about here is that it's a standard we're not yet supporting. And things like saying we can't test code we're distributing and encouraging people to use.
|
I see what you mean... the Intel Parallel STL and C++17 are optional, just to make sure. Now, testing on our systems with the Intel Parallel STL should be straightforward to setup. We could download the binary editions of the Parallel STL for Mac which work out of the box for me. The Linux binaries hopefully also work out of the box. That would be yet another thing to test - something to decide during the review. |
Taking this down as we are planning to go with the Intel TBB straight. |
I am adding WIP status as we probably need to settle on a bigger strategy as to how we move forward with thread based parallelism. Ideally we choose a single approach like the Intel TBB.
Summary
The intent is to allow for C++17 parallel STL backends to be used with
map_rect_concurrent
which offer speed improvements over the current existing C++11 based version. The C++17 backend is only used optionally if the user does one of:STAN_PSTL_CPP17
during compilation => the standard C++17 backend is usedSTAN_PSTL_INTEL
during compilation (and in addition supplies compile arguments to build and link against Intel PSTL) => the Intel Parallel STL is usedWhat has been done in this PR is to implement a parallel version of
std::for_each
using C++11 language only. This implementation is used as a fall-back if neither the C++17 one is available nor the Intel Parallel STL is available. The respective code has been placed instan/math/parallel/
.In order to test this PR with the Intel Parallel STL, please do (on a Mac):
make/local
these lines:Things should work on Linux as well... even Windows may be in scope with this as I saw somewhere that condor provides packages for it (at least this came up in one of my google searches, but I haven't found the link again). However, the Intel Threading Building Blocks on which the the Parallel STL is based, does support windows (but the binary edition for windows only includes builds for msvc).
Tests
This PR also fixes the recently discovered bug found in
map_rect_concurrent
whenever number of threads and jobs don't line up. The test for this is found intest/unit/math/prim/mat/functor/map_rect_concurrent_threads_test.cpp
Since this test is OK on this branch, I have prepared a second branch which has
develop
unchanged, but includes this tests (feature/map_rect-fail-develop
). On this branch the exact same test is failing (while it is fine on this branch).Other tests:
test/unit/math/parallel/for_each_test.cpp
to test the newfor_each
facilitytest/unit/math/parallel/get_num_threads_test.cpp
for tests to read off the number of threadsSide Effects
The C++17 standard does not specify at all how the number of threads are controlled. For the Intel Parallel STL we will need to add code to cmdstan which initializes the backend of the Intel Parallel STL accordingly (which is the Intel Threading Building Blocks).
The signature of the function
get_num_threads
was changed to now not anymore take as argument the number of jobs which we want to compute. The intent of the function is to read the number of threads off the environment variableSTAN_NUM_THREADS
and not more. This is done as we will have to add code to cmdstan which will control the number of threads for the Intel Parallel STL. In cmdstan it does not make sense to pass in the number of jobs - rather we just want to read out theSTAN_NUM_THREADS
.Additional Notes
It appears that we can get the Intel TBB (the backbone of the Parallel STL) to work on Windows using MinGW! I just followed these instructions and I have a
tbb.dll
on my Windows machine. In short the steps are: install MinGW/add MinGW path toPATH
/use the powershell to executemingw32-make.exe compiler=gcc
... thus it looks as if we will be able to get threading running on Windows with the MinGW compilers.Checklist
Math issue Make optional use of parallel C++17 algorithms in map_rect_concurrent #1084
Copyright holder: Sebastian Weber
The copyright holder is typically you or your assignee, such as a university or company. By submitting this pull request, the copyright holder is agreeing to the license the submitted work under the following licenses:
- Code: BSD 3-clause (https://opensource.org/licenses/BSD-3-Clause)
- Documentation: CC-BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
the basic tests are passing
./runTests.py test/unit
)make test-headers
)make doxygen
)make cpplint
)the code is written in idiomatic C++ and changes are documented in the doxygen
the new changes are tested