-
-
Notifications
You must be signed in to change notification settings - Fork 183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gpu stanmathcl [WIP] #655
gpu stanmathcl [WIP] #655
Conversation
@SteveBronder Thanks! Before we can merge, we need clear statements of copyright holder and licensing under BSD-3. Presumably that's you personally, but you need to add that to the pull request (there's a spot at the bottom of the form). |
@bob-carpenter np! The underlying algorithms are from @rok-cesnovar and his team. I believe we agreed it can all go under the BSD-3 license I'm going to look this over in the next few days, merge this into the gpu branch inside of the stan math repo, then we can start the actual review |
Yes, we agreed on BSD-3. For the underlying algorithm the copyright holder is Faculty of Computer and Information Science, University of Ljubljana. Thanks, Steve! |
@SteveBronder I made some minor changes, all tests still pass. Tomorrow I will do some performance analysis and post what to expect from all this. |
@rok-cesnovar awesome! |
First the results for the /prim/mat/fun cholesky_decompose_gpu compared to the CPU version. Hardware used (both CPU and GPU are on the high-end side): Summary:
I also plot the GPU execution times without the time it takes to copy data to and from the GPU, so you could see what could be achieved if we found a way of leaving the data on the GPU for the next iteration/function if the next iteration/function is also done on the GPU. As you can see, speedups of close to 100 could be achieved this way. Here are the execution times: The /rev/mat/fun cholesky_decompose_gpu timings take a bit more time, will post those once I have them. |
That's awesome! Can you post the code you used to run the speed test? |
@SteveBronder sure, I will make a repo and will link it here. But it wont be until tomorrow, it is getting late in the CEST time zone :) Here are the results for the /rev/mat/fun cholesky_decompose_gpu compared to the block CPU version that is currently used in rev/mat/fun cholesky_decompose_gpu. The CPU version of the algorithm we parallelized is slower than the block CPU. We have already looked into making a GPU version of the block algorithm, but for now this should do. Summary:
Without copying the data to and from the GPU the speedups are around 50 for M larger than 1200. |
@rok-cesnovar I updated the branch so it passes cpplint. I think our next steps are:
@syclik I'm unsure where some of the files should go and whether some file should be broken up into multiple files. EX1: check_gpu.hpp |
Agreed. I will start adding doxygen comments. Regarding the tests: for some of them we can just reuse the cpu tests, but functions like transpose and copying of lower triangular to the upper triangular probably do not have the equivalent. |
Awesome I'll start building out the tests |
@seantalts what is the level of coverage we need for the tests? For instance, does the |
This is a nuanced question. I think my own framework for thinking about it is that one needs tests that make sure all of the functionality you care about (or that someone might reasonably expect from the API) works correctly under a variety of inputs. So you could test that |
I agree with @seantalts that it's a nuanced question. My soft requirement (or strong opinion) is that every function / method gets at least one test. I'm not looking for branch coverage or anything like that (although if the function is complicated, you should). Adding a simple test really cuts down on maintenance cost, even if it doesn't seem like it's doing much. Think about coming back to this function months from now. Just having the instantiation of Of course, you also run fix design problems. If you can't write tests easily, then maybe there's a better design that should be employed. |
@syclik that's reasonable |
Functions that need tests
|
What should we do in cases where the operation is not thread safe, for instance multiply(A,A,A) which is essentialy A <- A*A. Should we throw an exception or cover this cases with temporary created buffers? |
You mean that the algorithm you are using for multiplication can’t be used in-place, right? Is there one that can be? If not I’d say use a temporary buffer. |
@rok-cesnovar When would |
Yeah, the GPU function as it is written now, uses the third parameter as the result matrix. multiply(A,B,C) is therefore C<- A * B. Rewriting this to copy on output also resolves the A<-A*A issue. Thanks! |
I apologize for troubling you @bob-carpenter @seantalts @syclik I have a question regarding the cholesky_decompose_test, specifically the test_chol_mult and the x_c matrix at line 68 here: https://github.com/stan-dev/math/blob/develop/test/unit/math/rev/mat/fun/cholesky_decompose_test.cpp It looks to me as if parts of the matrix is left uninitialized and that can cause some issues from time to time. I noticed some weird behaivor when running the tests as the initial input to the cholesky_decompose sometimes had NaN or Inf values. This is before any of the new added code is called. I then ran a few tests on the develop branch (no GPU code at all) and noticed something weird and would just like to double check if this is the expected behavior or is this a possible bug. These are some printed out examples of x_c from the above mentioned test. In all cases I ran test_chol_mult(25, 1e-08); I first ran all the tests in cholesky_decompose_test.cpp and got this matrix test_1.txt I then commented out all the tests except for the the mat_cholesky_1st_deriv_large_gradients. I got this for the input matrix I then swapped the test_gp_grad(1000, 1e-08) and test_gradient(36, 1e-08); tests and got this input matrix Other examples: Basically every time I change any parts of the tests or underlying stan::math code, part of the input changes. In some extreme cases I got NaN/Inf values in the bottom right part of the input matrix. Am I missing something or is this a bug in the test? Thank you for time! |
Oh thank god. You're right, that test fails from time to time (pretty rarely but I've seen it a few times this year). Want to make a separate PR to fix that? |
Sure, but I am not sure I have enough knowledge of the actual tests to know how to fix this. |
Which part is left uninitialized? Are you saying |
I think the problem is in the input x of the cov_matrix_constrain. It looks that x is uninitialized. |
It seems like x is passed in, right? So it should theoretically be initialized by I do think there's something that isn't initialized... |
I think I found it. The problem is in the for loop at line 279. I changed the loop to this:
This fixes the problem. |
@rok-cesnovar nice we are passing all the tests! I have two write three more tests for the underlying cholesky decomposition kernels and fix up the doxygen so we pass the jenkins build. But looks like we are good! I'll do this tmrw after work |
Thanks Steve! The underlying cholesky decomposition was rewritten a bit so that it reuses all the other GPU functions, except for a short kernel that calculates the cholesky for smaller blocks. So that one is the only that needs a separate test now. I will remove the remaining ones. Would it be of any value for the review process or in general if I made a PDF with a short explanation of the idea behind the cholesky GPU implementation? That one and the inversion are more complex, others are either standard GPU implementations (matrix multiply) or simple (transpose, zeros, identity,...). |
Yes to the PDF! Doc is helpful; especially motivation for design. Aim it at people that are in between novice and expert and it'll come in handy for the future too.
Thanks!
… On Nov 20, 2017, at 2:46 AM, rok-cesnovar ***@***.***> wrote:
The underlying cholesky decomposition was rewritten a bit so that it reuses all the other GPU functions, except for a short kernel that calculates the cholesky for smaller blocks. So that one is the only that needs a separate test now. I will remove the remaining ones.
Would it be of any value for the review process or in general if I made a PDF with a short explanation of the idea behind the cholesky GPU implementation? That one and the inversion are more complex, others are either standard GPU implementations (matrix multiply) or simple (transpose, zeros, identity,...).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
@rok-cesnovar On Jenkins we are still getting that OpenCL error for an invalid work group size http://d1m1s1b1.stat.columbia.edu:8080/job/Math%20Pull%20Request%20-%20Tests%20-%20Unit/949/console I'm looking into whether we can use |
Yeah, sure we can use that. I will extract the info of the device when creating the context and than just set the upper limits for the threads based on that info. It seems that the CPU on which Jenkins runs this only supports 64 work-items per group. I will fix this tonight, thanks. |
@syclik How do I interpret the current errors in Jenkins? http://d1m1s1b1.stat.columbia.edu:8080/job/Math%20Pipeline/job/PR-655/2/ Looking at the console output and content of the test(I think?) I'm not really sure how to interpret the errors |
I find the console output easiest to digest:
http://d1m1s1b1.stat.columbia.edu:8080/job/Math%20Pipeline/job/PR-655/2/console
Looks like a bunch of failed unit tests for some reason that seems unrelated
to anything you'd be doing with GPUs.
|
Thanks I'll look at the console in the future. Should I have Jenkins retest this or should someone else look into what is going on? ftr, I am expecting an error for invalid work group sizes ala this console output from a previous run. |
I also like the pipeline view: http://d1m1s1b1.stat.columbia.edu:8080/job/Math%20Pipeline/job/PR-655/2/flowGraphTable/ In the Unit phase I see a lot of this message:
|
1907a02
to
b2be485
Compare
Just recapping status: we are waiting on other pull requests. |
@SteveBronder, mind if I close this PR for now? We can reopen once we have the other pieces in. (Just trying to keep a handle on open Math PRs.) |
Totally, will do now |
It'd be nice if there was a way to keep the discussion for the GPU (or MPI) topic in fewer places, though I suppose at some point that discussion needs to be summarized and documented with the code or new PR. I'm always surprised when I'm following a PR and someone mentions that there's a corresponding thread on discourse with some of the same and some new information. |
+1. |
@seantalts: if you want to be proactive about it, want to come up with a draft of guidelines that we can discuss / encourage others to follow? It used to be: discuss the goals and design on stan-dev email list, put down a concrete issue on GitHub with clarification on the issue (but no real discussion), then a pull request was code-level comments. But that was just a social norm and never written down. As we've gotten new contributors and different tooling (discourse + GitHub adding templates, projects, and making it easier to merge across forks), we haven't been able to encourage putting the discussions in the right place. Putting up guidelines that we can point people to would be a good start. |
If that's enough of a description, I can just put that on a wiki and update the Guidelines (in the repo). |
What do you recommend to keep discussion in one place? Start a thread on discourse to discuss one-location discussions and link from here? :-)
|
From the outside perspective it's also hard to decide where to post about things. ie I made this discourse thread here about system errors, but thinking about it now this should have been posted on the PR. I think it could be something like
|
So there could be a few ways slice and dice what goes where. First though, I think maybe it could be helpful to have a mechanical rule that is always followed (and doesn't require judgement calls) we could go a long way:
Thoughts on this rule? As far as what-goes-where, I see discourse more as a forum for discussion that we tend to assume has a larger audience than a github PR. So open-ended design discussions, community-wide alerts, that kind of stuff can go there. I'd code-specific stuff in Github, as Daniel was saying, as long as it isn't an issue that requires a wider or longer discussion. I think it'd be good to write this up and put it on a wiki somewhere, but I don't currently understand how our wikis are organized and I'm not really sure new people would understand even which repo to look at for wikis, nevermind which ones are relevant to them. A separate problem, haha. |
Submission Checklist
./runTests.py test/unit
make cpplint
Summary:
Intended Effect:
How to Verify:
Side Effects:
Documentation:
Copyright and Licensing
Copyright:
Faculty of Computer and Information Science, University of Ljubljana.
Steve Bronder
Both parties agree to submit the Pull Request under the BSD 3 license
By submitting this pull request, the copyright holder is agreeing to license the submitted work under the following licenses: