Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor mpi build system #910

Merged
merged 15 commits into from
Jul 3, 2018
Merged

Refactor mpi build system #910

merged 15 commits into from
Jul 3, 2018

Conversation

seantalts
Copy link
Member

@seantalts seantalts commented Jun 19, 2018

EDIT: To modify the use of a specific toolset for boost is now easily modifyable by the user through the user-config.jam file within the boost lib folder.

Copyright and Licensing

Please list the copyright holder for the work you are submitting (this will be you or your assignee, such as a university or company): Columbia University

By submitting this pull request, the copyright holder is agreeing to license the submitted work under the following licenses:

@seantalts
Copy link
Member Author

@wds15 @syclik
This has a problem in that the compiler for MPI will be set to something like mpicxx and the actual C++ compiler passed in as a flag or environment variable... Fixing that will be non-trivial, and it's actually something we need to do if we want to support openmpi or mpich with a compiler not set as a flag on the command line (because you use an environment variable to tell openmpi which compiler to use, and that messes up our detect_cc script).

some options here:

  • I'm leaning towards having separate CC and MPI_CC settings - this way we can run detect_cc on the CC still and use the CC to pass to boost
  • We could tell people who want to use MPI that they have to use system default compilers because we can't configure it (and they have to make sure that boost sees the same default we do)
  • We could write code in detect_cc that looks in CC and, if it sees "mpicxx," checks all of the ways you can configure mpicxx (env variables and command line options differ between openmpi, mpich, etc) and attempt to pick the actual underlying C++ compiler out from those

@wds15
Copy link
Contributor

wds15 commented Jun 20, 2018

I am really not sure if we need to support a mpich / openMPI choice that explicitly. From my perspective there should be a single system MPI installation. In case there is openMPI and mpich installed then it is up to the user to load just one of them into the system path's. This can be done with update-alternatives or the modules facility or whatever.

Having more than MPI on a system is a mess if not managed adeuatly - and I don't think we need to manage that.

This is why I am still surprised to see the need for this PR. To me boost just does it magic and builds in whatever way is the default best way (which depends on what's in system's paths).

The whole point of the mpicxx command is to take away the burden to fiddle with MPI compiler and linker flags... and I haven't seen the need yet to have for boost mpi the same compiler as for the rest of Stan, but maybe that is wrong. In any case will be my preferred setup that the compiler used for MPI is also used for Stan, but that can be difficult to achieve since that compiler is usually the system compiler which can be old.

@seantalts
Copy link
Member Author

seantalts commented Jun 20, 2018 via email

@wds15
Copy link
Contributor

wds15 commented Jun 20, 2018

Sure.

Still.. is there a need to make boost mpi being build with a specific compiler aligned with Stan?

@syclik
Copy link
Member

syclik commented Jun 20, 2018 via email

@wds15
Copy link
Contributor

wds15 commented Jun 20, 2018

Damn. Computers are a minefield.

I am fine with getting this to work, but I would not mind leaving this extra configuration bit out if we run against walls here. Users which such involved systems can provide a working boost mpi install which works for Stan and the MPI installation... this would be my suggestion (unless it doesn't turn out of a too big headache for us).

Will look into this.

@syclik
Copy link
Member

syclik commented Jun 20, 2018 via email

@wds15
Copy link
Contributor

wds15 commented Jun 21, 2018

I have never seen ABI trouble and I suspect that this happens rarely and is likely a hint toward the need to upgrade the system anyways.

Everything was working already an all platforms (Travis mac, Travis Linux, Jenkins mac+Linux) so that I think we are in good shape.

BTW, I tend to disagree with the statement that Travis Linux is a typical setup. Their linux is an Ubuntu 14.04 image - more than 4 years old! To me that's outdated.

I will hopefully have time to look into this tomorrow (Friday).

The current failure on Travis is due to misconfiguration of clang for the boost system.

BTW, will it be possible to specify on Mac that you want the "darwin" compiler system? Darwin is the target on Apple platforms for the boost system. Probably we should introduce a make variable BOOST_CC_TARGET which we set accordingly.

As we are at it... maybe we also add then testing of MPI on Travis under Mac? If so I can add this to this PR which will make sure that things work there as well. Should I?

@syclik
Copy link
Member

syclik commented Jun 21, 2018 via email

@wds15
Copy link
Contributor

wds15 commented Jun 21, 2018

What a mess!

Here is what I would suggest:

  • We do not ov

@wds15 wds15 closed this Jun 21, 2018
@wds15
Copy link
Contributor

wds15 commented Jun 21, 2018

AHHHHH.... WRONG BUTTON.

@wds15 wds15 reopened this Jun 21, 2018
@wds15
Copy link
Contributor

wds15 commented Jun 21, 2018

I see the trouble... I think we should simplify our life a bit here and not got crazy with super smart auto-detection.

The problem with the proposed solution right now is that, for example, on Apple one should actually use darwin instead of clang or gcc as toolset. For intel compilers the toolset is intel. In short, I don't think we should attempt to fix the boost build system for the rare case that this is really a problem (I have never seen this).

Thus, I am suggesting to make our setup a bit friendlier in terms of specifying the boost toolset in case it is needed. I did this on a branch from this one, see here (in particular the changes in make/libraries).

For the CC_TYPE problem: We should just ask the user to set it in make/local manually. That is fair enough, I would say. We should add this to the wiki page; that would be my suggestion.

I hope this makes sense to everyone. Essentially we give users more control in case their system is weird. If all are fine with this, I am happy to merge the proposed changes into this branch.

@wds15
Copy link
Contributor

wds15 commented Jun 21, 2018

I forgot... there is an alternative to using mpicxx: In essence this is just a wrapper around a call to the compiler which makes sure that the right compiler and linker flags are set. For openMPI you can get this displayed by

> mpicxx --show-me
g++ -I/usr/lib/openmpi/include -I/usr/lib/openmpi/include/openmpi -pthread -L/usr//lib -L/usr/lib/openmpi/lib -lmpi_cxx -lmpi -ldl -lhwloc

So telling Stan to use CC=g++ and to use the above compiler and linker flags will give the same result as calling mpicxx. Its nothing more than a thin wrapper.

@syclik
Copy link
Member

syclik commented Jun 21, 2018 via email

@bob-carpenter
Copy link
Contributor

bob-carpenter commented Jun 21, 2018 via email

@wds15
Copy link
Contributor

wds15 commented Jun 22, 2018

@syclik Yes, we can use the options directly, but this is not recommended by the MPI docs to do so. So these two entries in make/local should give the same effect after all:

with mpicxx:

CXX=mpicxx
CC_TYPE=gcc

directly by using the output from --show-me (for mpich the command line option is --showme, I think):

CXX=g++
CXXFLAGS=-I/usr/lib/openmpi/include -I/usr/lib/openmpi/include/openmpi -pthread
LDFLAGS=-L/usr/lib -L/usr/lib/openmpi/lib -lmpi_cxx -lmpi -ldl -lhwloc

This will give the same result after all... but as I say the MPI docs clearly recommend to use the mpicxx wrappers (and they also recommend to just use the same compiler used for building MPI). In case of drastic things like the ABI change... how can you ensure that the MPI layer still works ok? Bear in mind that MPI is darn low-level and if you watch the build process of openMPI or mpich they do figure out the bit representation of all the low-level types. Thus, if the compiler is massively updated and messes with these base types then you are screwed. This is why I would only admit moderate deviations from the MPI compiler. In any other case you are running things at risk. Now this does not mean that we have to test all of this - we simply require that this works (and if someone wants to find out he can probably run the boost mpi test suite).

@seantalts
Copy link
Member Author

seantalts commented Jun 22, 2018

Wait, it's supposed to be with the compiler that OpenMPI or mpich was compiled with? How can we be sure to match that? What does boost do in its autodetect?

I think we have three tasks here:

  1. Make sure we use a single toolchain for all the stuff we compile
  2. Try to make that be the same toolchain the MPI library was compiled with (???)
  3. Allow for that toolchain to be configured by the user

I'd be willing to drop 3 for the MPI use case, but it seems like we still need to figure out the first two...

@wds15
Copy link
Contributor

wds15 commented Jun 24, 2018

First, I think by now that we are attempting the impossible here: Assume there is an ABI mismatch between the compiler which is supposed to be used for Stan and the one used to build the system MPI installation. This should fail in any case - even if we get boost to build with the Stan compiler. I mean, how is it possible for boost to build and link against the system MPI when the ABI has changed? That does not sound to me as if it can work at all.

I always assumed that boost build will figure out the correct thing to do. This is what their documentation is telling me. In rare cases you have to take actions yourself. This means that boost will usually just use the default compiler it finds on the system. This compiler is almost certainly the one which was used to build MPI and in consequence is the right one. Very likely boost is even smarter and parses what it gets from mpicxx. EDIT: Thus, if we use mpicxx for Stan we should be good wrt to how boost build will be setup.

Please have a look at the branch boost/build-2. That one does a few changes to this one which will allow the user to configure the boost toolset if he wants to. Doing so is really a very advanced thing to do.

Other than that, all we need to do is to ask users to use the mpicxx command as CC. Changing the compiler is at the risk of the user. Thus, we should add to our documentation on MPI that changing the compiler to anything else than mpicxx (like mpicxx -cxx=whatever) is at the risk of the user who has to ensure that the ABI has not changed. This is in line with recommendations given by the MPI implementations. We should stick to that.

Anything else is outside the range of what I would be willing to support. Of course, the mpicxx compiler will often be an old one which may not be ideal for building Stan. Now, our own tests show that there is some flexibility as we used clang++-3.8 or g++-4.9 on Travis. Thus, there is some room for flexibility, but this is not a gurantee which we have to make or attempt to support. As a start we should recommend the use of mpicxx and mention the combinations which we do know work.

@seantalts seantalts mentioned this pull request Jun 24, 2018
3 tasks
@bob-carpenter
Copy link
Contributor

bob-carpenter commented Jun 24, 2018 via email

@syclik
Copy link
Member

syclik commented Jun 25, 2018 via email

@wds15
Copy link
Contributor

wds15 commented Jun 28, 2018

In the line below it written

/usr/bin/ld: bin/math/prim/arr/functor/mpi_cluster_inst.o: relocation R_X86_64_32 against `.bss' can not be used when making a shared object; recompile with -fPIC

... so would a CXXFLAGS=-fPIC maybe help?

@wds15
Copy link
Contributor

wds15 commented Jun 28, 2018

Ok, I was just able to reproduce this problem on my VirtualBox and it turns out that the multiple translational unit test was indeed messed up. The solution was to compile the mpi_cluster_inst.cpp with the -fPIC compiler option. I am not sure what this actually does, but the test did run fine on my machine.

@seantalts
Copy link
Member Author

It will generate position independent code - practically speaking this sometimes results in slower code, but this SO post (which I trust like 40%) suggests that all shared libraries should be compiled with --fPIC anyway. I think it's fine, thanks for fixing.

@wds15
Copy link
Contributor

wds15 commented Jun 29, 2018

BTW, we are now forced to compile mpi_cluster_inst.cpp with a flag which causes slow-down merely due to one of our tests. This one is not a serious slow-down at all, but to me this shows that we are often bound to the most conservative setting despite the fact that most stan programs are never build as shared library. Not so ideal.

@syclik
Copy link
Member

syclik commented Jun 29, 2018 via email

@wds15
Copy link
Contributor

wds15 commented Jun 29, 2018

I don't think that all of Stan gets slower ... only the code for mpi_cluster_inst.cpp will be compiled with -fPIC.

I know why the test is there. In order to comply to it, I am forced to compiler mpi_cluster_inst.cpp always with -fPIC.

It would be better to have a flag like "compile Stan for a shared lib" which then switches on globally -fPIC. This way the price to pay for shared libs would be payed if that is really needed and not all the time. As this right now only affects a very small portion of the MPI code, there is no realy harm. I was just commenting this as these type of decision we often do (maybe that's good). For this case it is not really worth to spend more time on it, I think.

@syclik
Copy link
Member

syclik commented Jun 29, 2018 via email

@wds15
Copy link
Contributor

wds15 commented Jun 29, 2018

Yeah, this is probably worth an issue... I make a note in my head for the moment being.

@seantalts
Copy link
Member Author

So is this ok to merge?

I'm also wondering why we didn't need -fPIC on it before...

@wds15
Copy link
Contributor

wds15 commented Jun 29, 2018

I think we can merge, yes. Maybe we tested before on macOS by chance?

@wds15
Copy link
Contributor

wds15 commented Jul 2, 2018

See also stan-dev/stan#2558 for explanations for the latest commit which we need urgently.

I do actually expect that the Jenkins upstream run will fail, because cmdstan does not have the changes in PR stan-dev/cmdstan#628 yet merged. Thus we need to merge this jointly in short succession once we all agree on it.

@wds15
Copy link
Contributor

wds15 commented Jul 2, 2018

@seantalts please have a look at this latest commit. Thanks.

@seantalts
Copy link
Member Author

Why did you add that to this pull request? It looks good, though.

@seantalts
Copy link
Member Author

So we have to merge this PR and stan-dev/cmdstan#628 at the same time?

@wds15
Copy link
Contributor

wds15 commented Jul 2, 2018

Yes, both need to go in short succession. The key thing I added into this is to replace

LIBMPI = $(BOOST_LIB)/libboost_serialization$(DLL) $(BOOST_LIB)/libboost_mpi$(DLL) bin/math/prim/arr/functor/mpi_cluster_inst.o

with

LIBMPI = $(BOOST_LIB)/libboost_serialization$(DLL) $(BOOST_LIB)/libboost_mpi$(DLL) $(MATH)bin/math/prim/arr/functor/mpi_cluster_inst.o

(and similar changes). Only then will the upstream packages refer to the correct mpi_cluster_inst.o.

And the change for cmdstan ensure that in the main makefile of cmdstan we include the snippet from stan-math which tells make the rule on how actually make the mpi_cluster_inst.o.

@wds15
Copy link
Contributor

wds15 commented Jul 2, 2018

(and sorry for adding it to this one)... if you think it is cleaner, we make a separate PR for this small change which includes only the last commit. I thought it was easier, but I have probably not followed cleanly our processes here.

@seantalts
Copy link
Member Author

seantalts commented Jul 2, 2018

That's fine for now. I'm going to kick off a test run for this PR + the cmdstan branch at the same time...

[edit] here is the job http://d1m1s1b1.stat.columbia.edu:8080/blue/organizations/jenkins/Math%20Pipeline/detail/PR-910/12/pipeline

@wds15
Copy link
Contributor

wds15 commented Jul 2, 2018

cool. I did not know that we can do this type of testing. Right now the Jenkins seems to hang?

@seantalts
Copy link
Member Author

seantalts commented Jul 2, 2018 via email

@wds15
Copy link
Contributor

wds15 commented Jul 2, 2018

No, I am not sure (this stuff is not too transparent to me). Let's wait a bit more.

@seantalts
Copy link
Member Author

seantalts commented Jul 2, 2018 via email

@seantalts seantalts changed the title Address part of #905 - pass compiler setting through to boost Refactor mpi build system Jul 2, 2018
@seantalts seantalts merged commit 9164c25 into develop Jul 3, 2018
@seantalts seantalts deleted the build/boost-compiler branch July 3, 2018 04:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants