-
-
Notifications
You must be signed in to change notification settings - Fork 187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor mpi build system #910
Conversation
@wds15 @syclik some options here:
|
I am really not sure if we need to support a mpich / openMPI choice that explicitly. From my perspective there should be a single system MPI installation. In case there is openMPI and mpich installed then it is up to the user to load just one of them into the system path's. This can be done with Having more than MPI on a system is a mess if not managed adeuatly - and I don't think we need to manage that. This is why I am still surprised to see the need for this PR. To me boost just does it magic and builds in whatever way is the default best way (which depends on what's in system's paths). The whole point of the |
This PR is about having different compilers, not different MPI libraries.
On Wed, Jun 20, 2018 at 11:51 wds15 ***@***.***> wrote:
I am really not sure if we need to do that. From my perspective there
should be a single system MPI installation. In case there is openMPI and
mpich installed then it is up to the user to load just one of them into the
system path's. This can be done with update-alternatives or the modules
facility or whatever.
Having more than MPI on a system is a mess if not managed adeuatly - and I
don't think we need to manage that.
This is why I am still surprised to see the need for this PR. To me boost
just does it magic and builds in whatever way is the default best way
(which depends on what's in system's paths).
The whole point of the mpicxx command is to take away the burden to
fiddle with MPI compiler and linker flags... and I haven't seen the need
yet to have for boost mpi the same compiler as for the rest of Stan, but
maybe that is wrong. In any case will be my preferred setup that the
compiler used for MPI is also used for Stan, but that can be difficult to
achieve since that compiler is usually the system compiler which can be old.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#910 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAxJ7IAemhkNc4dTC1nwgh8kXJ-hcggOks5t-ilIgaJpZM4UtneT>
.
--
Via smol keyboard
|
Sure. Still.. is there a need to make boost mpi being build with a specific compiler aligned with Stan? |
Yes. It only works when they adhere to the same ABI, which is only guanteed with the same library. Sometimes it'll work across versions of a compiler and sometimes across compilers, but that can't be taken as an assumption.
… On Jun 20, 2018, at 6:55 AM, wds15 ***@***.***> wrote:
Sure.
Still.. is there a need to make boost mpi being build with a specific compiler aligned with Stan?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Damn. Computers are a minefield. I am fine with getting this to work, but I would not mind leaving this extra configuration bit out if we run against walls here. Users which such involved systems can provide a working boost mpi install which works for Stan and the MPI installation... this would be my suggestion (unless it doesn't turn out of a too big headache for us). Will look into this. |
Yes, it's tricky. And often undocumented or even documented incorrectly.
It’s be great if we had a set of specific instructions that worked if
followed exactly. It doesn’t matter so much if it is complicated as long as
it is clear and correct.
If that set of instructions don’t work under normal user setups (which I
think Travis and our Jenkins boxes represent), then I think we need to be
clear when it doesn’t work (or be really clear about the conditions under
which it does work). If we know why or have alternate instructions, we
should also put that down.
…On Wed, Jun 20 2018 at 9:35 AM, ***@***.***> wrote:
Damn. Computers are a minefield.
I am fine with getting this to work, but I would not mind leaving this
extra configuration bit out if we run against walls here. Users which such
involved systems can provide a working boost mpi install which works for
Stan and the MPI installation... this would be my suggestion (unless it
doesn't turn out of a too big headache for us).
Will look into this.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#910 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAZ_F5MesO9zIZ-8i-Mn4lLXUR0pf4eDks5t-k-HgaJpZM4UtneT>
.
|
I have never seen ABI trouble and I suspect that this happens rarely and is likely a hint toward the need to upgrade the system anyways. Everything was working already an all platforms (Travis mac, Travis Linux, Jenkins mac+Linux) so that I think we are in good shape. BTW, I tend to disagree with the statement that Travis Linux is a typical setup. Their linux is an Ubuntu 14.04 image - more than 4 years old! To me that's outdated. I will hopefully have time to look into this tomorrow (Friday). The current failure on Travis is due to misconfiguration of clang for the boost system. BTW, will it be possible to specify on Mac that you want the "darwin" compiler system? Darwin is the target on Apple platforms for the boost system. Probably we should introduce a make variable BOOST_CC_TARGET which we set accordingly. As we are at it... maybe we also add then testing of MPI on Travis under Mac? If so I can add this to this PR which will make sure that things work there as well. Should I? |
Yes, the ABI problem doesn’t really show up **when you compile everything
consistently**. I’ve seen it when I’ve compiled the CVODES library with
g++, then used an upgraded version of clang++. What we’re now doing with
libraries for MPI and GPU, we’ll see this much more often. When there are
binaries shipped from somewhere, including Boost, there’s a real chance
that it’s mismatched and won’t work.
Re: Ubuntu 14.04. I agree that it’s old, but I don’t think that it’s that
uncommon to have a stable, but old system. In a recent short course (2016),
I had multiple people with Windows XP. We don’t need to make it work for
that platform. We need to let people know that we either haven’t tested it
or we know it doesn’t work. If we know it doesn’t work, it’d be great if we
could trap that error and give users a nice out, but that isn’t necessary.
What we don’t want to do is put out instructions that tell you how to
change your system and how to make it work, then just fail after that.
…On Thu, Jun 21 2018 at 3:18 AM, ***@***.***> wrote:
I have never seen ABI trouble and I suspect that this happens rarely and
is likely a hint toward the need to upgrade the system anyways.
Everything was working already an all platforms (Travis mac, Travis Linux,
Jenkins mac+Linux) so that I think we are in good shape.
BTW, I tend to disagree with the statement that Travis Linux is a typical
setup. Their linux is an Ubuntu 14.04 image - more than 4 years old! To me
that's outdated.
I will hopefully have time to look into this tomorrow (Friday).
The current failure on Travis is due to misconfiguration of clang for the
boost system.
BTW, will it be possible to specify on Mac that you want the "darwin"
compiler system? Darwin is the target on Apple platforms for the boost
system. Probably we should introduce a make variable BOOST_CC_TARGET which
we set accordingly.
As we are at it... maybe we also add then testing of MPI on Travis under
Mac? If so I can add this to this PR which will make sure that things work
there as well. Should I?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#910 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAZ_FysnXtxICX6JmSOh2FvI1ZnE5qMmks5t-0jVgaJpZM4UtneT>
.
|
What a mess! Here is what I would suggest:
|
AHHHHH.... WRONG BUTTON. |
I see the trouble... I think we should simplify our life a bit here and not got crazy with super smart auto-detection. The problem with the proposed solution right now is that, for example, on Apple one should actually use Thus, I am suggesting to make our setup a bit friendlier in terms of specifying the boost toolset in case it is needed. I did this on a branch from this one, see here (in particular the changes in For the I hope this makes sense to everyone. Essentially we give users more control in case their system is weird. If all are fine with this, I am happy to merge the proposed changes into this branch. |
I forgot... there is an alternative to using
So telling Stan to use CC=g++ and to use the above compiler and linker flags will give the same result as calling |
I remembering asking about this before.
Are you suggesting we use the options directly? (After running that command)
…On Thu, Jun 21 2018 at 2:39 PM, ***@***.***> wrote:
I forgot... there is an alternative to using mpicxx: In essence this is
just a wrapper around a call to the compiler which makes sure that the
right compiler and linker flags are set. For openMPI you can get this
displayed by
> mpicxx --show-me
g++ -I/usr/lib/openmpi/include -I/usr/lib/openmpi/include/openmpi -pthread -L/usr//lib -L/usr/lib/openmpi/lib -lmpi_cxx -lmpi -ldl -lhwloc
So telling Stan to use CC=g++ and to use the above compiler and linker
flags will give the same result as calling mpicxx. Its nothing more than
a thin wrapper.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#910 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAZ_F_wdxOBsJfMa--vMB7aHCK84txTjks5t--hogaJpZM4UtneT>
.
|
I think darwin is just what Xcode reports---it's clang++:
$ clang++ --version
Apple LLVM version 9.1.0 (clang-902.0.39.2)
Target: x86_64-apple-darwin17.6.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
|
@syclik Yes, we can use the options directly, but this is not recommended by the MPI docs to do so. So these two entries in with mpicxx:
directly by using the output from
This will give the same result after all... but as I say the MPI docs clearly recommend to use the |
Wait, it's supposed to be with the compiler that OpenMPI or mpich was compiled with? How can we be sure to match that? What does boost do in its autodetect? I think we have three tasks here:
I'd be willing to drop 3 for the MPI use case, but it seems like we still need to figure out the first two... |
First, I think by now that we are attempting the impossible here: Assume there is an ABI mismatch between the compiler which is supposed to be used for Stan and the one used to build the system MPI installation. This should fail in any case - even if we get boost to build with the Stan compiler. I mean, how is it possible for boost to build and link against the system MPI when the ABI has changed? That does not sound to me as if it can work at all. I always assumed that boost build will figure out the correct thing to do. This is what their documentation is telling me. In rare cases you have to take actions yourself. This means that boost will usually just use the default compiler it finds on the system. This compiler is almost certainly the one which was used to build MPI and in consequence is the right one. Very likely boost is even smarter and parses what it gets from Please have a look at the branch Other than that, all we need to do is to ask users to use the Anything else is outside the range of what I would be willing to support. Of course, the |
We decided at the beginning of this that we would start with a few configs that we knew how to support and grow out rather than trying to launch covering everything that Stan currently supports. I'd very much like to stick to that plan, so I think it should be up to Sebastian to delimit that support. As long as the requirements are documented, we don't have to support everything.
The one thing we do have to make sure is that we won't break a standard config that's not trying to use MPI.
|
Yes, completely agreed on both.
It shouldn't affect all the other users that don't want to use MPI. I think
we're good there.
For those that do want to use MPI, it should be clear when it does work. I
think this experimenting is all part of figuring that out; I'm pretty sure
we assumed it'd work in lots of conditions where it doesn't. I think
documenting what we know would be a good path forward.
…On Sun, Jun 24, 2018 at 11:14 AM, Bob ***@***.***> wrote:
We decided at the beginning of this that we would start with a few configs
that we knew how to support and grow out rather than trying to launch
covering everything that Stan currently supports. I'd very much like to
stick to that plan, so I think it should be up to Sebastian to delimit that
support. As long as the requirements are documented, we don't have to
support everything.
The one thing we do have to make sure is that we won't break a standard
config that's not trying to use MPI.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#910 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAZ_F_hI8tlj-MhxQcYZg2I95axI9PNrks5t_6zWgaJpZM4UtneT>
.
|
In the line below it written
... so would a |
Ok, I was just able to reproduce this problem on my VirtualBox and it turns out that the multiple translational unit test was indeed messed up. The solution was to compile the |
It will generate position independent code - practically speaking this sometimes results in slower code, but this SO post (which I trust like 40%) suggests that all shared libraries should be compiled with - |
BTW, we are now forced to compile |
Sorry, mind explaining that again? I didn’t quite understand it. Does all
of Stan get slower or just one test?
That test is there because that’s how it’s being used by multiple clients
of the math library. This includes researchers who are linking multiple
translation units as well as RStan. So, it is a real use case and an active
one.
…On Fri, Jun 29 2018 at 2:36 AM, ***@***.***> wrote:
BTW, we are now forced to compile mpi_cluster_inst.cpp with a flag which
causes slow-down merely due to one of our tests. This one is not a serious
slow-down at all, but to me this shows that we are often bound to the most
conservative setting despite the fact that most stan programs are never
build as shared library. Not so ideal.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#910 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAZ_F1lOM-9OlxImgp6qyZar_oq534bVks5uBcsIgaJpZM4UtneT>
.
|
I don't think that all of Stan gets slower ... only the code for I know why the test is there. In order to comply to it, I am forced to compiler It would be better to have a flag like "compile Stan for a shared lib" which then switches on globally |
It would have helped to create a new discussion or a new issue laying that
out rather than making a complaint in an existing pull request. It’s hard
to know what to do with a comment like that except mention that there’s a
reason for it.
If there’s a reasonable solution, which you’ve laid out, we should put that
in an issue so someone can fix it. If not, it will get lost once this pull
request is done.
…On Fri, Jun 29 2018 at 6:56 AM, ***@***.***> wrote:
I don't think that all of Stan gets slower ... only the code for
mpi_cluster_inst.cpp will be compiled with -fPIC.
I know why the test is there. In order to comply to it, I am forced to
compiler mpi_cluster_inst.cpp *always* with -fPIC.
It would be better to have a flag like "compile Stan for a shared lib"
which then switches on globally -fPIC. This way the price to pay for
shared libs would be payed if that is really needed and not all the time.
As this right now only affects a very small portion of the MPI code, there
is no realy harm. I was just commenting this as these type of decision we
often do (maybe that's good). For this case it is not really worth to spend
more time on it, I think.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#910 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAZ_Fy5UDPtPZsg1umOgcHKCZQvWnw4Tks5uBgfkgaJpZM4UtneT>
.
|
Yeah, this is probably worth an issue... I make a note in my head for the moment being. |
So is this ok to merge? I'm also wondering why we didn't need -fPIC on it before... |
I think we can merge, yes. Maybe we tested before on macOS by chance? |
See also stan-dev/stan#2558 for explanations for the latest commit which we need urgently. I do actually expect that the Jenkins upstream run will fail, because cmdstan does not have the changes in PR stan-dev/cmdstan#628 yet merged. Thus we need to merge this jointly in short succession once we all agree on it. |
@seantalts please have a look at this latest commit. Thanks. |
Why did you add that to this pull request? It looks good, though. |
So we have to merge this PR and stan-dev/cmdstan#628 at the same time? |
Yes, both need to go in short succession. The key thing I added into this is to replace
with
(and similar changes). Only then will the upstream packages refer to the correct And the change for cmdstan ensure that in the main makefile of cmdstan we include the snippet from stan-math which tells make the rule on how actually make the |
(and sorry for adding it to this one)... if you think it is cleaner, we make a separate PR for this small change which includes only the last commit. I thought it was easier, but I have probably not followed cleanly our processes here. |
That's fine for now. I'm going to kick off a test run for this PR + the cmdstan branch at the same time... [edit] here is the job http://d1m1s1b1.stat.columbia.edu:8080/blue/organizations/jenkins/Math%20Pipeline/detail/PR-910/12/pipeline |
cool. I did not know that we can do this type of testing. Right now the Jenkins seems to hang? |
Hang? Are you sure it's not just queued waiting for a node to run on?
On Mon, Jul 2, 2018 at 10:46 wds15 ***@***.***> wrote:
cool. I did not know that we can do this type of testing. Right now the
Jenkins seems to hang?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#910 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAxJ7CzGWa0bs-SW7aCH8vSf6wILuf0Pks5uCjImgaJpZM4UtneT>
.
--
Via smol keyboard
|
No, I am not sure (this stuff is not too transparent to me). Let's wait a bit more. |
Should say on the that link at the bottom of the page that it's waiting for
a node.
On Mon, Jul 2, 2018 at 11:07 wds15 ***@***.***> wrote:
No, I am not sure (this stuff is not too transparent to me). Let's wait a
bit more.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#910 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAxJ7JzEqqiYVhLqjLzFT9kJn5etX4azks5uCjcXgaJpZM4UtneT>
.
--
Via smol keyboard
|
EDIT: To modify the use of a specific toolset for boost is now easily modifyable by the user through the
user-config.jam
file within the boost lib folder.Copyright and Licensing
Please list the copyright holder for the work you are submitting (this will be you or your assignee, such as a university or company): Columbia University
By submitting this pull request, the copyright holder is agreeing to license the submitted work under the following licenses: