Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gpu stanmathcl [WIP] #655

Closed
wants to merge 112 commits into from

Conversation

SteveBronder
Copy link
Collaborator

@SteveBronder SteveBronder commented Oct 23, 2017

Submission Checklist

  • Run unit tests: ./runTests.py test/unit
  • Run cpplint: make cpplint
  • Declare copyright holder and open-source license: see below

Summary:

Intended Effect:

How to Verify:

Side Effects:

Documentation:

Copyright and Licensing

Copyright:

Faculty of Computer and Information Science, University of Ljubljana.
Steve Bronder

Both parties agree to submit the Pull Request under the BSD 3 license

By submitting this pull request, the copyright holder is agreeing to license the submitted work under the following licenses:

@bob-carpenter
Copy link
Contributor

@SteveBronder Thanks!

Before we can merge, we need clear statements of copyright holder and licensing under BSD-3. Presumably that's you personally, but you need to add that to the pull request (there's a spot at the bottom of the form).

@SteveBronder
Copy link
Collaborator Author

@bob-carpenter np!

The underlying algorithms are from @rok-cesnovar and his team. I believe we agreed it can all go under the BSD-3 license

I'm going to look this over in the next few days, merge this into the gpu branch inside of the stan math repo, then we can start the actual review

@rok-cesnovar
Copy link
Member

Yes, we agreed on BSD-3.

For the underlying algorithm the copyright holder is Faculty of Computer and Information Science, University of Ljubljana.

Thanks, Steve!

@rok-cesnovar
Copy link
Member

@SteveBronder I made some minor changes, all tests still pass.

Tomorrow I will do some performance analysis and post what to expect from all this.

@SteveBronder
Copy link
Collaborator Author

@rok-cesnovar awesome!

@rok-cesnovar
Copy link
Member

rok-cesnovar commented Oct 25, 2017

First the results for the /prim/mat/fun cholesky_decompose_gpu compared to the CPU version.

Hardware used (both CPU and GPU are on the high-end side):
CPU: Core i7-4790 @ 3.60GHz
GPU: NVIDIA GTX1070

Summary:

  • the GPU version is faster than CPU/Eigen version at matrices with M=200
  • at M~1000 the speedup is ~8-10
  • at M~2500 the speedup is ~14
  • at M~4000 the speedup is ~20

Timing and speedup plot:
time_chol_prim

speedup

I also plot the GPU execution times without the time it takes to copy data to and from the GPU, so you could see what could be achieved if we found a way of leaving the data on the GPU for the next iteration/function if the next iteration/function is also done on the GPU. As you can see, speedups of close to 100 could be achieved this way.

Here are the execution times:
https://www.dropbox.com/s/akndbsojte2z38t/timing_gpu_cholesky_prim.txt?dl=0

The /rev/mat/fun cholesky_decompose_gpu timings take a bit more time, will post those once I have them.

@SteveBronder
Copy link
Collaborator Author

That's awesome! Can you post the code you used to run the speed test?

@rok-cesnovar
Copy link
Member

@SteveBronder sure, I will make a repo and will link it here. But it wont be until tomorrow, it is getting late in the CEST time zone :)

Here are the results for the /rev/mat/fun cholesky_decompose_gpu compared to the block CPU version that is currently used in rev/mat/fun cholesky_decompose_gpu. The CPU version of the algorithm we parallelized is slower than the block CPU. We have already looked into making a GPU version of the block algorithm, but for now this should do.

Summary:

  • the GPU version is faster for matrices over M=50
  • at M~1000 the speedup is ~10
  • at M~3000 is ~20

Without copying the data to and from the GPU the speedups are around 50 for M larger than 1200.

rev_time

rev_speedup

@SteveBronder
Copy link
Collaborator Author

SteveBronder commented Oct 26, 2017

@rok-cesnovar I updated the branch so it passes cpplint. I think our next steps are:

  • Documentation
  • Build tests for all the gpu functions
  • Figure out the best way for users to touch this (this is more of a Stan team thing which we have spoken about previously)
  • Discuss with the team the best way to review this
  • Request reviews

@syclik I'm unsure where some of the files should go and whether some file should be broken up into multiple files.

EX1: check_gpu.hpp
EX2: basic_matrix_gpu.hpp
EX3: (file) kern_gpu

@rok-cesnovar
Copy link
Member

Agreed. I will start adding doxygen comments.

Regarding the tests: for some of them we can just reuse the cpu tests, but functions like transpose and copying of lower triangular to the upper triangular probably do not have the equivalent.

@SteveBronder
Copy link
Collaborator Author

Awesome I'll start building out the tests

@SteveBronder
Copy link
Collaborator Author

@seantalts what is the level of coverage we need for the tests? For instance, does the zeros() function need a unit test?

@seantalts
Copy link
Member

This is a nuanced question. I think my own framework for thinking about it is that one needs tests that make sure all of the functionality you care about (or that someone might reasonably expect from the API) works correctly under a variety of inputs. So you could test that zeros works directly (easy enough to add a quick one) or you could test that functions that rely on it working correctly also work. @syclik might have more advice or a more stringent opinion here.

@syclik
Copy link
Member

syclik commented Oct 30, 2017

I agree with @seantalts that it's a nuanced question.

My soft requirement (or strong opinion) is that every function / method gets at least one test. I'm not looking for branch coverage or anything like that (although if the function is complicated, you should). Adding a simple test really cuts down on maintenance cost, even if it doesn't seem like it's doing much. Think about coming back to this function months from now. Just having the instantiation of A and part makes it really easy to track down a bug if you're triggering that cl::Error. Or it makes it easy to refactor the code because there's already some scaffolding in place to add more tests if that's what's required. Just think about how quick it would be for you to write the test now vs. coming back to this and having to figure out how to write the test with fresh eyes.

Of course, you also run fix design problems. If you can't write tests easily, then maybe there's a better design that should be employed.

@SteveBronder
Copy link
Collaborator Author

@syclik that's reasonable

@SteveBronder
Copy link
Collaborator Author

SteveBronder commented Oct 30, 2017

Functions that need tests

  • matrix_gpu functions
  • copy matrix functions
  • Compile kernels in ocl_gpu
  • transpose (both src -> dst and in-place
  • zeros
  • identity
  • copy_triangular
  • copy_submatrix
  • copy_triangular_transposed
  • check_square
  • add (tests seg fault for vector_d)
  • subtract (tests seg fault for vector_d)
  • lower_triangular_inverse
  • diagonal_multiply
  • multiply
  • check_diagonal_zeros

@rok-cesnovar
Copy link
Member

What should we do in cases where the operation is not thread safe, for instance multiply(A,A,A) which is essentialy A <- A*A. Should we throw an exception or cover this cases with temporary created buffers?

@seantalts
Copy link
Member

You mean that the algorithm you are using for multiplication can’t be used in-place, right? Is there one that can be? If not I’d say use a temporary buffer.

@bob-carpenter
Copy link
Contributor

@rok-cesnovar When would multiply(A, A, A) come up? It's not something that can be expressed as a Stan function because all the Stan functions copy on output.

@rok-cesnovar
Copy link
Member

rok-cesnovar commented Nov 1, 2017

Yeah, the GPU function as it is written now, uses the third parameter as the result matrix. multiply(A,B,C) is therefore C<- A * B. Rewriting this to copy on output also resolves the A<-A*A issue.

Thanks!

@rok-cesnovar
Copy link
Member

rok-cesnovar commented Nov 13, 2017

I apologize for troubling you @bob-carpenter @seantalts @syclik

I have a question regarding the cholesky_decompose_test, specifically the test_chol_mult and the x_c matrix at line 68 here: https://github.com/stan-dev/math/blob/develop/test/unit/math/rev/mat/fun/cholesky_decompose_test.cpp

It looks to me as if parts of the matrix is left uninitialized and that can cause some issues from time to time. I noticed some weird behaivor when running the tests as the initial input to the cholesky_decompose sometimes had NaN or Inf values. This is before any of the new added code is called.

I then ran a few tests on the develop branch (no GPU code at all) and noticed something weird and would just like to double check if this is the expected behavior or is this a possible bug.

These are some printed out examples of x_c from the above mentioned test. In all cases I ran test_chol_mult(25, 1e-08);

I first ran all the tests in cholesky_decompose_test.cpp and got this matrix test_1.txt
The input was the same on each run.

I then commented out all the tests except for the the mat_cholesky_1st_deriv_large_gradients. I got this for the input matrix
test_2.txt

I then swapped the test_gp_grad(1000, 1e-08) and test_gradient(36, 1e-08); tests and got this input matrix
test_3.txt

Other examples:
test_4.txt
test_6.txt

Basically every time I change any parts of the tests or underlying stan::math code, part of the input changes. In some extreme cases I got NaN/Inf values in the bottom right part of the input matrix.
However, in all cases, the upper left part of the matrix is the same.

Am I missing something or is this a bug in the test? Thank you for time!

@seantalts
Copy link
Member

Oh thank god. You're right, that test fails from time to time (pretty rarely but I've seen it a few times this year). Want to make a separate PR to fix that?

@rok-cesnovar
Copy link
Member

Sure, but I am not sure I have enough knowledge of the actual tests to know how to fix this.

@seantalts
Copy link
Member

Which part is left uninitialized? Are you saying cov_matrix_constrain doesn't initialize all of the matrix?

@rok-cesnovar
Copy link
Member

rok-cesnovar commented Nov 13, 2017

I think the problem is in the input x of the cov_matrix_constrain. It looks that x is uninitialized.

@seantalts
Copy link
Member

It seems like x is passed in, right? So it should theoretically be initialized by stan::math::gradient and finite_diff_gradient on lines 288 and 294?

I do think there's something that isn't initialized...

@rok-cesnovar
Copy link
Member

I think I found it.

The problem is in the for loop at line 279.
test_vec.size() is smaller than vec_size

I changed the loop to this:

for (int i = 0; i < vec_size; ++i) {
    test_vals(i) = i % 10 / 100.0;
    if (i < test_vec.size())
      test_vec(i) = stan::math::normal_rng(0.0, 0.1, rng);    
  }

This fixes the problem.

@SteveBronder
Copy link
Collaborator Author

@rok-cesnovar nice we are passing all the tests!

I have two write three more tests for the underlying cholesky decomposition kernels and fix up the doxygen so we pass the jenkins build. But looks like we are good! I'll do this tmrw after work

@rok-cesnovar
Copy link
Member

rok-cesnovar commented Nov 20, 2017

Thanks Steve!

The underlying cholesky decomposition was rewritten a bit so that it reuses all the other GPU functions, except for a short kernel that calculates the cholesky for smaller blocks. So that one is the only that needs a separate test now. I will remove the remaining ones.

Would it be of any value for the review process or in general if I made a PDF with a short explanation of the idea behind the cholesky GPU implementation? That one and the inversion are more complex, others are either standard GPU implementations (matrix multiply) or simple (transpose, zeros, identity,...).

@syclik
Copy link
Member

syclik commented Nov 20, 2017 via email

@SteveBronder
Copy link
Collaborator Author

SteveBronder commented Dec 13, 2017

@rok-cesnovar On Jenkins we are still getting that OpenCL error for an invalid work group size

http://d1m1s1b1.stat.columbia.edu:8080/job/Math%20Pull%20Request%20-%20Tests%20-%20Unit/949/console

I'm looking into whether we can use CL_DEVICE_MAX_WORK_GROUP_SIZE from this stackoverflow answer to allocate the work group size on a per machine basis.

@rok-cesnovar
Copy link
Member

rok-cesnovar commented Dec 13, 2017

Yeah, sure we can use that. I will extract the info of the device when creating the context and than just set the upper limits for the threads based on that info.

It seems that the CPU on which Jenkins runs this only supports 64 work-items per group. I will fix this tonight, thanks.

@SteveBronder
Copy link
Collaborator Author

@syclik How do I interpret the current errors in Jenkins?

http://d1m1s1b1.stat.columbia.edu:8080/job/Math%20Pipeline/job/PR-655/2/

Looking at the console output and content of the test(I think?) I'm not really sure how to interpret the errors

@bob-carpenter
Copy link
Contributor

bob-carpenter commented Dec 28, 2017 via email

@SteveBronder
Copy link
Collaborator Author

Thanks I'll look at the console in the future. Should I have Jenkins retest this or should someone else look into what is going on?

ftr, I am expecting an error for invalid work group sizes ala this console output from a previous run.

@seantalts
Copy link
Member

I also like the pipeline view: http://d1m1s1b1.stat.columbia.edu:8080/job/Math%20Pipeline/job/PR-655/2/flowGraphTable/

In the Unit phase I see a lot of this message:

clang: : error: : -framework OpenCL: 'linker' input unused

@SteveBronder SteveBronder changed the title Gpu stanmathcl Gpu stanmathcl [WIP] Jan 20, 2018
@SteveBronder SteveBronder changed the title Gpu stanmathcl [WIP] gpu stanmathcl [WIP] Feb 2, 2018
@syclik syclik added the wip label Feb 14, 2018
@syclik
Copy link
Member

syclik commented Feb 14, 2018

Just recapping status: we are waiting on other pull requests.

@syclik
Copy link
Member

syclik commented Feb 23, 2018

@SteveBronder, mind if I close this PR for now? We can reopen once we have the other pieces in. (Just trying to keep a handle on open Math PRs.)

@SteveBronder
Copy link
Collaborator Author

Totally, will do now

@seantalts
Copy link
Member

It'd be nice if there was a way to keep the discussion for the GPU (or MPI) topic in fewer places, though I suppose at some point that discussion needs to be summarized and documented with the code or new PR. I'm always surprised when I'm following a PR and someone mentions that there's a corresponding thread on discourse with some of the same and some new information.

@syclik
Copy link
Member

syclik commented Feb 23, 2018

+1.

@syclik
Copy link
Member

syclik commented Feb 23, 2018

@seantalts: if you want to be proactive about it, want to come up with a draft of guidelines that we can discuss / encourage others to follow?

It used to be: discuss the goals and design on stan-dev email list, put down a concrete issue on GitHub with clarification on the issue (but no real discussion), then a pull request was code-level comments. But that was just a social norm and never written down.

As we've gotten new contributors and different tooling (discourse + GitHub adding templates, projects, and making it easier to merge across forks), we haven't been able to encourage putting the discussions in the right place. Putting up guidelines that we can point people to would be a good start.

@syclik
Copy link
Member

syclik commented Feb 23, 2018

If that's enough of a description, I can just put that on a wiki and update the Guidelines (in the repo).

@bob-carpenter
Copy link
Contributor

bob-carpenter commented Feb 23, 2018 via email

@SteveBronder
Copy link
Collaborator Author

From the outside perspective it's also hard to decide where to post about things. ie I made this discourse thread here about system errors, but thinking about it now this should have been posted on the PR.

I think it could be something like

  1. Proposal on discourse
  2. File detailed proposal as a github issue (possibly with MVP included)
  3. Make a fork/branch with first draft and submit PR
  4. Discuss code and CI issues on PR and design issues on discourse (?)
  5. Iterate until ready for review
  6. ???
  7. Merged!

@seantalts
Copy link
Member

So there could be a few ways slice and dice what goes where. First though, I think maybe it could be helpful to have a mechanical rule that is always followed (and doesn't require judgement calls) we could go a long way:

  1. If it comes to your attention there are discourse threads discussing something relevant to a PR or issue, put all links in the top, original issue or PR post (the one with the copyright information, etc) so that it's all collected in one place. I think doing the same thing on discourse (adding all relevant github links to the original post) is probably also good.

Thoughts on this rule?

As far as what-goes-where, I see discourse more as a forum for discussion that we tend to assume has a larger audience than a github PR. So open-ended design discussions, community-wide alerts, that kind of stuff can go there. I'd code-specific stuff in Github, as Daniel was saying, as long as it isn't an issue that requires a wider or longer discussion.

I think it'd be good to write this up and put it on a wiki somewhere, but I don't currently understand how our wikis are organized and I'm not really sure new people would understand even which repo to look at for wikis, nevermind which ones are relevant to them. A separate problem, haha.

@rok-cesnovar rok-cesnovar deleted the gpu_stanmathcl branch March 14, 2019 19:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants