Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: map_rect_concurrent with parallel STL backends #1085

Closed
wants to merge 7 commits into from

Conversation

wds15
Copy link
Contributor

@wds15 wds15 commented Dec 16, 2018

I am adding WIP status as we probably need to settle on a bigger strategy as to how we move forward with thread based parallelism. Ideally we choose a single approach like the Intel TBB.

Summary

The intent is to allow for C++17 parallel STL backends to be used with map_rect_concurrent which offer speed improvements over the current existing C++11 based version. The C++17 backend is only used optionally if the user does one of:

  • the user defines STAN_PSTL_CPP17 during compilation => the standard C++17 backend is used
  • the user defines STAN_PSTL_INTEL during compilation (and in addition supplies compile arguments to build and link against Intel PSTL) => the Intel Parallel STL is used
  • if none of the above is done, then a C++11 implementation is used instead

What has been done in this PR is to implement a parallel version of std::for_each using C++11 language only. This implementation is used as a fall-back if neither the C++17 one is available nor the Intel Parallel STL is available. The respective code has been placed in stan/math/parallel/.

In order to test this PR with the Intel Parallel STL, please do (on a Mac):

  • download the binary edition of the Intel Parallel STL, from here
  • unpack it where you like
  • put into your make/local these lines:
// you need a g++ compiler which support openMP SIMD... this one is g++ v7.0 from MacPorts
CXX=g++-mp-7
CC=g++-mp-7

// adjust the TBBROOT & PSTLROOT path here (both are packaged together from Intel)
TBBROOT = $(HOME)/work/tbb2019_20181203oss
PSTLROOT = $(HOME)/work/pstl2019_20181203oss
CXXFLAGS += -std=c++17 -I$(TBBROOT)/include -I$(PSTLROOT)/include -fopenmp-simd
LDFLAGS += -L$(TBBROOT)/lib -Wl,-rpath,$(TBBROOT)/lib -ltbb -ltbbmalloc

Things should work on Linux as well... even Windows may be in scope with this as I saw somewhere that condor provides packages for it (at least this came up in one of my google searches, but I haven't found the link again). However, the Intel Threading Building Blocks on which the the Parallel STL is based, does support windows (but the binary edition for windows only includes builds for msvc).

Tests

This PR also fixes the recently discovered bug found in map_rect_concurrent whenever number of threads and jobs don't line up. The test for this is found in

test/unit/math/prim/mat/functor/map_rect_concurrent_threads_test.cpp

Since this test is OK on this branch, I have prepared a second branch which has develop unchanged, but includes this tests (feature/map_rect-fail-develop). On this branch the exact same test is failing (while it is fine on this branch).

Other tests:

test/unit/math/parallel/for_each_test.cpp to test the new for_each facility
test/unit/math/parallel/get_num_threads_test.cpp for tests to read off the number of threads

Side Effects

The C++17 standard does not specify at all how the number of threads are controlled. For the Intel Parallel STL we will need to add code to cmdstan which initializes the backend of the Intel Parallel STL accordingly (which is the Intel Threading Building Blocks).

The signature of the function get_num_threads was changed to now not anymore take as argument the number of jobs which we want to compute. The intent of the function is to read the number of threads off the environment variable STAN_NUM_THREADS and not more. This is done as we will have to add code to cmdstan which will control the number of threads for the Intel Parallel STL. In cmdstan it does not make sense to pass in the number of jobs - rather we just want to read out the STAN_NUM_THREADS.

Additional Notes

It appears that we can get the Intel TBB (the backbone of the Parallel STL) to work on Windows using MinGW! I just followed these instructions and I have a tbb.dll on my Windows machine. In short the steps are: install MinGW/add MinGW path to PATH/use the powershell to execute mingw32-make.exe compiler=gcc... thus it looks as if we will be able to get threading running on Windows with the MinGW compilers.

Checklist

@wds15
Copy link
Contributor Author

wds15 commented Dec 16, 2018

In case it helps to have a quick hangout to explain this PR, I can jump on a call this week.

Tagging a few people who I know have seen this code .. @seantalts , @sakrejda , @bob-carpenter , @syclik ... maybe more, not sure.

@bob-carpenter
Copy link
Contributor

I haven't actually looked at the lower-level code, but I am curious how all the user-facing config is going to play out.

I think this is going to require close scrutiny from @syclik and also @seantalts for implications for the math lib and language.

First, as a matter of policy, this is bringing C++17 code into our codebase. Presumably that's ifdef-ed out in non-C++17 environments? How is that determined and maintained?

Will we require C++17 and Intel STL for continuous integration testing? If not, how do we test these advertised options work?

I'm very confused about how all these parallelization schemes interact, but haven't been following any of them very closely. For instance, is STAN_PSTL_INTEL compatible with MPI and GPU code? And presumably there's some interaction of STAN_PSTL_CPP17 and STAN_PSTL_INTEL---for example, what happens if both are specified?

@wds15
Copy link
Contributor Author

wds15 commented Dec 16, 2018

The rules are simple, I hope:

  • if the user does not define anything => C++11 is used just as always
  • if the user declares STAN_PSTL_CPP17 => we use C++17 parallel stuff
  • if the user declares STAN_PSTL_INTEL => we use C++17 parallel stuff from Intel

in case the user specifies both of the PSTL macros, then the CPP17 takes precedence.

This code is orthogonal to MPI and GPU. It should just work fine together.

I am not sure if there is a need to test the Intel PSTL or the C++17 variant of this... this would be the same as asking if some construct from the standard library works. I mean... this should just work; all I am doing is using the std::for_each with execution policies.

@bob-carpenter
Copy link
Contributor

bob-carpenter commented Dec 16, 2018 via email

@bob-carpenter
Copy link
Contributor

bob-carpenter commented Dec 16, 2018 via email

Copy link
Member

@syclik syclik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just marking this as request changes so it doesn't get merged immediately. #1084 needs to be updated with information about how this affects GPU, MPI, and multi-threading. (And then we'll have to decide whether we want those interactions.)

@wds15
Copy link
Contributor Author

wds15 commented Dec 17, 2018

Ok... I updated the issue for this with a lengthy description for those interactions.

The key thing to take note of that I do not change anything in terms of how we deal with threading here. It is just a better (optional) backend which will give use massive speed gains. To see this, just follow the issue which first describes where we are now and where we go with this change.

@bob-carpenter : I am glad you like the design as it sounds. Wrt to your concern of clashes between the parallel STL and the standard C++17: From going over the sources from the Parallel STL I think they do take care of this situation. So they are cautious about this problem as it appears to me. Now... I wanted to test this by including execution from the standard library just before the respective PSTL includes... but I can't as I do not have a compiler available to me which supports execution. The current open-source compiler just don't provide this. I could (maybe) try with Intel's compiler which seems to have support for C++17 parallel features.

Now, there is one more parallel STL implementation out there which is HPX. It got my attention in this post. However, when I installed it and tried to use it the Stan program just segfaulted such that I gave up, but in theory this should be a very nice alternative for non-Intel systems. We just need to find someone who can compile it and run it (it would take me a moment).

@wds15
Copy link
Contributor Author

wds15 commented Dec 17, 2018

@seantalts ... could you have a look after the linux instance which does not seem to respond. Thanks much!

(I already restarted the tests once which I think is all I can do from my end)

@bob-carpenter
Copy link
Contributor

bob-carpenter commented Dec 17, 2018 via email

@wds15
Copy link
Contributor Author

wds15 commented Dec 17, 2018

@bob-carpenter you write "It uses multi-threading, right? Doesn't that mean there'll be thread contention if you use multi-threaded map_rect and this."

I am not sure I understand what you mean. The "old" map_rect_concurrent on the basis of async does not exist anymore. The old function has been refactored to now use for_each with execution policies.

OpenMP would fire off independent threads. The Intel Parallel STL has been written such that it seamlessly integrates with existing applications. In case a compiler supports OpenMP and the parallel C++17 STL, then that will work together as well.

The Intel Parallel STL has a number of macros which check what has been defined, what compiler is used and all that.

Current compilers I am aware of which support the parallel C++17 stuff are

  • msvc
  • Intel compilers
    Parallel STL libraries:
  • Intel PSTL
  • HPX
  • there are a few more

Intel has offered to the GCC and to the libc++ project their Parallel STL library to be used as basis for their C++17 implementation. As far as I am informed the GCC project will do so and we will probably see this in g++ 9.0. Don't know what libc++ (clang) will do.

Hooking up HPX is really a matter of compiling it (in a way that it works), including their includes and then defining #define hpx::parallel std_par - nothing more to be done from the Stan side (OK, limiting the number of threads is still needed somewhere in cmdstan).

The nice thing about using the C++17 standard is that everyone is adhering to it (in theory).

@seantalts
Copy link
Member

Would it be crazy to just print an error if more than 1 of GPU, MPI, or THREADING are turned on in the makefiles? The only usecase I can think of is maybe a threading one where maybe you want to try to offload only a small-ish part of the likelihood to the GPU, and I think we should probably try to design for that case specifically (in the future / separate PR) instead of estimating that it should just work fine... It would really help me and others understand the boundaries of what should be going on, I think.

@wds15
Copy link
Contributor Author

wds15 commented Dec 17, 2018

I gave a detailed description of my understandig of what goes together how & under what conditions (to my understanding) in the issue for this pr....maybe we should copy that into a wiki page? This question comes up very often.

@seantalts
Copy link
Member

I did read that :) Just to clarify, I'm talking about a general solution to the "How do these compile flags / modes interact" perennial question as well, and not something the code in this PR should address.

Just to highlight, I think our current understanding is a little thorny in the following areas:

in case map_rect is called and each job is accessing the GPU resource, then this needs locking of the GPU ressource... I do not know how the GPU code is organized here, maybe @rok-cesnovar can comment. However, this is nothing new.

it is still possible to make a for_each parallel call... this is only safe if the AD stack is not touched and the GPU resource is thread-safe.

what does not work is to nest MPI calls inside of threads. So if we have multiple threads and each thread wants to make MPI calls, then this will fail with the current setup. It is possible to my knowledge to make this work - but this cannot happen in any Stan program as of now. This constellation is impossible to achieve with a given Stan program. The only situation this can arise is if an interface decides to sample multiple chains using threads and then allows MPI as a backend. This won't wok at the moment.

Mostly things either do not work or we do not know how they might work. I am here noting this and stating that it's actually fine if none of these interactions work for now, and in fact declaring this explicitly would help everyone involved and greatly ease what we're trying to communicate to users. So I propose we say for now that no interactions between these modes and their make flags is currently supported and we introduce somewhere a little error message if someone tries to use more than one at once. We can also if desired add a github issue for any use-cases any of us are able to think up where an interaction mode is useful (ideally with a model example).

@wds15
Copy link
Contributor Author

wds15 commented Dec 17, 2018

Sounds good to me...each of these features is a killer in its own. Starting with the policy that we don‘t intend the interactions to work is fine for...in particular if we give errors about this.

@bob-carpenter
Copy link
Contributor

bob-carpenter commented Dec 17, 2018 via email

@wds15
Copy link
Contributor Author

wds15 commented Dec 18, 2018

I see what you mean... the Intel Parallel STL and C++17 are optional, just to make sure. Now, testing on our systems with the Intel Parallel STL should be straightforward to setup. We could download the binary editions of the Parallel STL for Mac which work out of the box for me. The Linux binaries hopefully also work out of the box. That would be yet another thing to test - something to decide during the review.

@wds15 wds15 changed the title map_rect_concurrent with parallel STL backends WIP: map_rect_concurrent with parallel STL backends Jan 7, 2019
@wds15
Copy link
Contributor Author

wds15 commented Jan 27, 2019

Taking this down as we are planning to go with the Intel TBB straight.

@wds15 wds15 closed this Jan 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants