gpu stanmathcl [WIP] #655

SteveBronder · 2017-10-23T02:49:20Z

Submission Checklist

Run unit tests: ./runTests.py test/unit
Run cpplint: make cpplint
Declare copyright holder and open-source license: see below

Summary:

Intended Effect:

How to Verify:

Side Effects:

Documentation:

Copyright and Licensing

Copyright:

Faculty of Computer and Information Science, University of Ljubljana.
Steve Bronder

Both parties agree to submit the Pull Request under the BSD 3 license

By submitting this pull request, the copyright holder is agreeing to license the submitted work under the following licenses:

Code: BSD 3-clause (https://opensource.org/licenses/BSD-3-Clause)
Documentation: CC-BY 4.0 (https://creativecommons.org/licenses/by/4.0/)

bob-carpenter · 2017-10-23T14:03:35Z

@SteveBronder Thanks!

Before we can merge, we need clear statements of copyright holder and licensing under BSD-3. Presumably that's you personally, but you need to add that to the pull request (there's a spot at the bottom of the form).

SteveBronder · 2017-10-23T15:55:51Z

@bob-carpenter np!

The underlying algorithms are from @rok-cesnovar and his team. I believe we agreed it can all go under the BSD-3 license

I'm going to look this over in the next few days, merge this into the gpu branch inside of the stan math repo, then we can start the actual review

rok-cesnovar · 2017-10-24T06:11:09Z

Yes, we agreed on BSD-3.

For the underlying algorithm the copyright holder is Faculty of Computer and Information Science, University of Ljubljana.

Thanks, Steve!

rok-cesnovar · 2017-10-24T19:16:33Z

@SteveBronder I made some minor changes, all tests still pass.

Tomorrow I will do some performance analysis and post what to expect from all this.

SteveBronder · 2017-10-24T19:36:05Z

@rok-cesnovar awesome!

rok-cesnovar · 2017-10-25T09:11:43Z

First the results for the /prim/mat/fun cholesky_decompose_gpu compared to the CPU version.

Hardware used (both CPU and GPU are on the high-end side):
CPU: Core i7-4790 @ 3.60GHz
GPU: NVIDIA GTX1070

Summary:

the GPU version is faster than CPU/Eigen version at matrices with M=200
at M~1000 the speedup is ~8-10
at M~2500 the speedup is ~14
at M~4000 the speedup is ~20

Timing and speedup plot:

I also plot the GPU execution times without the time it takes to copy data to and from the GPU, so you could see what could be achieved if we found a way of leaving the data on the GPU for the next iteration/function if the next iteration/function is also done on the GPU. As you can see, speedups of close to 100 could be achieved this way.

Here are the execution times:
https://www.dropbox.com/s/akndbsojte2z38t/timing_gpu_cholesky_prim.txt?dl=0

The /rev/mat/fun cholesky_decompose_gpu timings take a bit more time, will post those once I have them.

SteveBronder · 2017-10-25T19:14:38Z

That's awesome! Can you post the code you used to run the speed test?

rok-cesnovar · 2017-10-25T20:01:35Z

@SteveBronder sure, I will make a repo and will link it here. But it wont be until tomorrow, it is getting late in the CEST time zone :)

Here are the results for the /rev/mat/fun cholesky_decompose_gpu compared to the block CPU version that is currently used in rev/mat/fun cholesky_decompose_gpu. The CPU version of the algorithm we parallelized is slower than the block CPU. We have already looked into making a GPU version of the block algorithm, but for now this should do.

Summary:

the GPU version is faster for matrices over M=50
at M~1000 the speedup is ~10
at M~3000 is ~20

Without copying the data to and from the GPU the speedups are around 50 for M larger than 1200.

SteveBronder · 2017-10-26T03:42:31Z

@rok-cesnovar I updated the branch so it passes cpplint. I think our next steps are:

Documentation
Build tests for all the gpu functions
Figure out the best way for users to touch this (this is more of a Stan team thing which we have spoken about previously)
Discuss with the team the best way to review this
Request reviews

@syclik I'm unsure where some of the files should go and whether some file should be broken up into multiple files.

EX1: check_gpu.hpp
EX2: basic_matrix_gpu.hpp
EX3: (file) kern_gpu

rok-cesnovar · 2017-10-26T17:10:06Z

Agreed. I will start adding doxygen comments.

Regarding the tests: for some of them we can just reuse the cpu tests, but functions like transpose and copying of lower triangular to the upper triangular probably do not have the equivalent.

SteveBronder · 2017-10-26T17:36:50Z

Awesome I'll start building out the tests

SteveBronder · 2017-10-30T00:45:33Z

@seantalts what is the level of coverage we need for the tests? For instance, does the zeros() function need a unit test?

seantalts · 2017-10-30T01:01:59Z

This is a nuanced question. I think my own framework for thinking about it is that one needs tests that make sure all of the functionality you care about (or that someone might reasonably expect from the API) works correctly under a variety of inputs. So you could test that zeros works directly (easy enough to add a quick one) or you could test that functions that rely on it working correctly also work. @syclik might have more advice or a more stringent opinion here.

syclik · 2017-10-30T01:27:43Z

I agree with @seantalts that it's a nuanced question.

My soft requirement (or strong opinion) is that every function / method gets at least one test. I'm not looking for branch coverage or anything like that (although if the function is complicated, you should). Adding a simple test really cuts down on maintenance cost, even if it doesn't seem like it's doing much. Think about coming back to this function months from now. Just having the instantiation of A and part makes it really easy to track down a bug if you're triggering that cl::Error. Or it makes it easy to refactor the code because there's already some scaffolding in place to add more tests if that's what's required. Just think about how quick it would be for you to write the test now vs. coming back to this and having to figure out how to write the test with fresh eyes.

Of course, you also run fix design problems. If you can't write tests easily, then maybe there's a better design that should be employed.

SteveBronder · 2017-10-30T01:52:53Z

@syclik that's reasonable

SteveBronder · 2017-10-30T02:01:01Z

rok-cesnovar · 2017-10-31T12:02:43Z

What should we do in cases where the operation is not thread safe, for instance multiply(A,A,A) which is essentialy A <- A*A. Should we throw an exception or cover this cases with temporary created buffers?

seantalts · 2017-10-31T15:34:34Z

You mean that the algorithm you are using for multiplication can’t be used in-place, right? Is there one that can be? If not I’d say use a temporary buffer.

bob-carpenter · 2017-10-31T21:44:05Z

@rok-cesnovar When would multiply(A, A, A) come up? It's not something that can be expressed as a Stan function because all the Stan functions copy on output.

rok-cesnovar · 2017-11-01T18:16:32Z

Yeah, the GPU function as it is written now, uses the third parameter as the result matrix. multiply(A,B,C) is therefore C<- A * B. Rewriting this to copy on output also resolves the A<-A*A issue.

Thanks!

rok-cesnovar · 2017-11-13T18:15:42Z

I apologize for troubling you @bob-carpenter @seantalts @syclik

I have a question regarding the cholesky_decompose_test, specifically the test_chol_mult and the x_c matrix at line 68 here: https://github.com/stan-dev/math/blob/develop/test/unit/math/rev/mat/fun/cholesky_decompose_test.cpp

It looks to me as if parts of the matrix is left uninitialized and that can cause some issues from time to time. I noticed some weird behaivor when running the tests as the initial input to the cholesky_decompose sometimes had NaN or Inf values. This is before any of the new added code is called.

I then ran a few tests on the develop branch (no GPU code at all) and noticed something weird and would just like to double check if this is the expected behavior or is this a possible bug.

These are some printed out examples of x_c from the above mentioned test. In all cases I ran test_chol_mult(25, 1e-08);

I first ran all the tests in cholesky_decompose_test.cpp and got this matrix test_1.txt
The input was the same on each run.

I then commented out all the tests except for the the mat_cholesky_1st_deriv_large_gradients. I got this for the input matrix
test_2.txt

I then swapped the test_gp_grad(1000, 1e-08) and test_gradient(36, 1e-08); tests and got this input matrix
test_3.txt

Other examples:
test_4.txt
test_6.txt

Basically every time I change any parts of the tests or underlying stan::math code, part of the input changes. In some extreme cases I got NaN/Inf values in the bottom right part of the input matrix.
However, in all cases, the upper left part of the matrix is the same.

Am I missing something or is this a bug in the test? Thank you for time!

seantalts · 2017-11-13T18:20:28Z

Oh thank god. You're right, that test fails from time to time (pretty rarely but I've seen it a few times this year). Want to make a separate PR to fix that?

rok-cesnovar · 2017-11-13T18:24:12Z

Sure, but I am not sure I have enough knowledge of the actual tests to know how to fix this.

seantalts · 2017-11-13T18:27:34Z

Which part is left uninitialized? Are you saying cov_matrix_constrain doesn't initialize all of the matrix?

rok-cesnovar · 2017-11-13T18:39:45Z

I think the problem is in the input x of the cov_matrix_constrain. It looks that x is uninitialized.

seantalts · 2017-11-13T18:45:04Z

It seems like x is passed in, right? So it should theoretically be initialized by stan::math::gradient and finite_diff_gradient on lines 288 and 294?

I do think there's something that isn't initialized...

rok-cesnovar · 2017-11-13T19:03:28Z

I think I found it.

The problem is in the for loop at line 279.
test_vec.size() is smaller than vec_size

I changed the loop to this:

for (int i = 0; i < vec_size; ++i) {
    test_vals(i) = i % 10 / 100.0;
    if (i < test_vec.size())
      test_vec(i) = stan::math::normal_rng(0.0, 0.1, rng);    
  }

This fixes the problem.

SteveBronder · 2017-11-20T04:09:16Z

@rok-cesnovar nice we are passing all the tests!

I have two write three more tests for the underlying cholesky decomposition kernels and fix up the doxygen so we pass the jenkins build. But looks like we are good! I'll do this tmrw after work

rok-cesnovar · 2017-11-20T07:46:13Z

Thanks Steve!

The underlying cholesky decomposition was rewritten a bit so that it reuses all the other GPU functions, except for a short kernel that calculates the cholesky for smaller blocks. So that one is the only that needs a separate test now. I will remove the remaining ones.

Would it be of any value for the review process or in general if I made a PDF with a short explanation of the idea behind the cholesky GPU implementation? That one and the inversion are more complex, others are either standard GPU implementations (matrix multiply) or simple (transpose, zeros, identity,...).

syclik · 2017-11-20T11:47:33Z

Yes to the PDF! Doc is helpful; especially motivation for design. Aim it at people that are in between novice and expert and it'll come in handy for the future too. Thanks!

…

On Nov 20, 2017, at 2:46 AM, rok-cesnovar ***@***.***> wrote: The underlying cholesky decomposition was rewritten a bit so that it reuses all the other GPU functions, except for a short kernel that calculates the cholesky for smaller blocks. So that one is the only that needs a separate test now. I will remove the remaining ones. Would it be of any value for the review process or in general if I made a PDF with a short explanation of the idea behind the cholesky GPU implementation? That one and the inversion are more complex, others are either standard GPU implementations (matrix multiply) or simple (transpose, zeros, identity,...). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

SteveBronder · 2017-12-13T13:05:13Z

@rok-cesnovar On Jenkins we are still getting that OpenCL error for an invalid work group size

http://d1m1s1b1.stat.columbia.edu:8080/job/Math%20Pull%20Request%20-%20Tests%20-%20Unit/949/console

I'm looking into whether we can use CL_DEVICE_MAX_WORK_GROUP_SIZE from this stackoverflow answer to allocate the work group size on a per machine basis.

rok-cesnovar · 2017-12-13T16:39:12Z

Yeah, sure we can use that. I will extract the info of the device when creating the context and than just set the upper limits for the threads based on that info.

It seems that the CPU on which Jenkins runs this only supports 64 work-items per group. I will fix this tonight, thanks.

SteveBronder · 2017-12-28T23:25:53Z

@syclik How do I interpret the current errors in Jenkins?

http://d1m1s1b1.stat.columbia.edu:8080/job/Math%20Pipeline/job/PR-655/2/

Looking at the console output and content of the test(I think?) I'm not really sure how to interpret the errors

bob-carpenter · 2017-12-28T23:51:16Z

I find the console output easiest to digest: http://d1m1s1b1.stat.columbia.edu:8080/job/Math%20Pipeline/job/PR-655/2/console Looks like a bunch of failed unit tests for some reason that seems unrelated to anything you'd be doing with GPUs.

SteveBronder · 2017-12-29T00:23:33Z

Thanks I'll look at the console in the future. Should I have Jenkins retest this or should someone else look into what is going on?

ftr, I am expecting an error for invalid work group sizes ala this console output from a previous run.

seantalts · 2017-12-29T06:18:26Z

I also like the pipeline view: http://d1m1s1b1.stat.columbia.edu:8080/job/Math%20Pipeline/job/PR-655/2/flowGraphTable/

In the Unit phase I see a lot of this message:

clang: : error: : -framework OpenCL: 'linker' input unused

syclik · 2018-02-14T14:35:52Z

Just recapping status: we are waiting on other pull requests.

syclik · 2018-02-23T14:33:58Z

@SteveBronder, mind if I close this PR for now? We can reopen once we have the other pieces in. (Just trying to keep a handle on open Math PRs.)

SteveBronder · 2018-02-23T14:37:36Z

Totally, will do now

seantalts · 2018-02-23T14:41:40Z

It'd be nice if there was a way to keep the discussion for the GPU (or MPI) topic in fewer places, though I suppose at some point that discussion needs to be summarized and documented with the code or new PR. I'm always surprised when I'm following a PR and someone mentions that there's a corresponding thread on discourse with some of the same and some new information.

syclik · 2018-02-23T14:43:07Z

+1.

syclik · 2018-02-23T14:48:11Z

@seantalts: if you want to be proactive about it, want to come up with a draft of guidelines that we can discuss / encourage others to follow?

It used to be: discuss the goals and design on stan-dev email list, put down a concrete issue on GitHub with clarification on the issue (but no real discussion), then a pull request was code-level comments. But that was just a social norm and never written down.

As we've gotten new contributors and different tooling (discourse + GitHub adding templates, projects, and making it easier to merge across forks), we haven't been able to encourage putting the discussions in the right place. Putting up guidelines that we can point people to would be a good start.

syclik · 2018-02-23T14:54:38Z

If that's enough of a description, I can just put that on a wiki and update the Guidelines (in the repo).

bob-carpenter · 2018-02-23T15:31:27Z

What do you recommend to keep discussion in one place? Start a thread on discourse to discuss one-location discussions and link from here? :-)

SteveBronder · 2018-02-23T15:33:54Z

From the outside perspective it's also hard to decide where to post about things. ie I made this discourse thread here about system errors, but thinking about it now this should have been posted on the PR.

I think it could be something like

Proposal on discourse
File detailed proposal as a github issue (possibly with MVP included)
Make a fork/branch with first draft and submit PR
Discuss code and CI issues on PR and design issues on discourse (?)
Iterate until ready for review
???
Merged!

seantalts · 2018-02-27T13:06:00Z

So there could be a few ways slice and dice what goes where. First though, I think maybe it could be helpful to have a mechanical rule that is always followed (and doesn't require judgement calls) we could go a long way:

If it comes to your attention there are discourse threads discussing something relevant to a PR or issue, put all links in the top, original issue or PR post (the one with the copyright information, etc) so that it's all collected in one place. I think doing the same thing on discourse (adding all relevant github links to the original post) is probably also good.

Thoughts on this rule?

As far as what-goes-where, I see discourse more as a forum for discussion that we tend to assume has a larger audience than a github PR. So open-ended design discussions, community-wide alerts, that kind of stuff can go there. I'd code-specific stuff in Github, as Daniel was saying, as long as it isn't an issue that requires a wider or longer discussion.

I think it'd be good to write this up and put it on a wiki somewhere, but I don't currently understand how our wikis are organized and I'm not really sure new people would understand even which repo to look at for wikis, nevermind which ones are relevant to them. A separate problem, haha.

SteveBronder added 4 commits December 11, 2017 08:11

Use a string to throw a domain error for OpenCL failures

895ce30

...

8077b41

Add constant tolerance header file to check_gpu.hpp

54d9c19

add basic_matrix_gpu to headers for multipy_gpu

eb174f2

update AMD SDK download script

fa0376b

rok-cesnovar added 2 commits January 6, 2018 21:24

Max workgroup size is considered for determining workgroup sizes

64e0317

- removed residual debug info

b2be485

SteveBronder force-pushed the gpu_stanmathcl branch from 1907a02 to b2be485 Compare January 13, 2018 05:59

SteveBronder changed the title ~~Gpu stanmathcl~~ Gpu stanmathcl [WIP] Jan 20, 2018

SteveBronder changed the title ~~Gpu stanmathcl [WIP]~~ gpu stanmathcl [WIP] Feb 2, 2018

syclik added the wip label Feb 14, 2018

syclik added the needs work label Feb 14, 2018

SteveBronder closed this Feb 23, 2018

rok-cesnovar deleted the gpu_stanmathcl branch March 14, 2019 19:57

gpu stanmathcl [WIP] #655

gpu stanmathcl [WIP] #655

Conversation

SteveBronder commented Oct 23, 2017 • edited Loading

Submission Checklist

Summary:

Intended Effect:

How to Verify:

Side Effects:

Documentation:

Copyright and Licensing

bob-carpenter commented Oct 23, 2017

SteveBronder commented Oct 23, 2017

rok-cesnovar commented Oct 24, 2017

rok-cesnovar commented Oct 24, 2017

SteveBronder commented Oct 24, 2017

rok-cesnovar commented Oct 25, 2017 • edited Loading

SteveBronder commented Oct 25, 2017

rok-cesnovar commented Oct 25, 2017

SteveBronder commented Oct 26, 2017 • edited Loading

rok-cesnovar commented Oct 26, 2017

SteveBronder commented Oct 26, 2017

SteveBronder commented Oct 30, 2017

seantalts commented Oct 30, 2017

syclik commented Oct 30, 2017

SteveBronder commented Oct 30, 2017

SteveBronder commented Oct 30, 2017 • edited Loading

rok-cesnovar commented Oct 31, 2017

seantalts commented Oct 31, 2017

bob-carpenter commented Oct 31, 2017

rok-cesnovar commented Nov 1, 2017 • edited Loading

rok-cesnovar commented Nov 13, 2017 • edited Loading

seantalts commented Nov 13, 2017

rok-cesnovar commented Nov 13, 2017

seantalts commented Nov 13, 2017

rok-cesnovar commented Nov 13, 2017 • edited Loading

seantalts commented Nov 13, 2017

rok-cesnovar commented Nov 13, 2017

SteveBronder commented Nov 20, 2017

rok-cesnovar commented Nov 20, 2017 • edited Loading

syclik commented Nov 20, 2017 via email

SteveBronder commented Dec 13, 2017 • edited Loading

rok-cesnovar commented Dec 13, 2017 • edited Loading

SteveBronder commented Dec 28, 2017

bob-carpenter commented Dec 28, 2017 via email

SteveBronder commented Dec 29, 2017

seantalts commented Dec 29, 2017

syclik commented Feb 14, 2018

syclik commented Feb 23, 2018

SteveBronder commented Feb 23, 2018

seantalts commented Feb 23, 2018

syclik commented Feb 23, 2018

syclik commented Feb 23, 2018

syclik commented Feb 23, 2018

bob-carpenter commented Feb 23, 2018 via email

SteveBronder commented Feb 23, 2018

seantalts commented Feb 27, 2018

SteveBronder commented Oct 23, 2017 •

edited

Loading

rok-cesnovar commented Oct 25, 2017 •

edited

Loading

SteveBronder commented Oct 26, 2017 •

edited

Loading

SteveBronder commented Oct 30, 2017 •

edited

Loading

rok-cesnovar commented Nov 1, 2017 •

edited

Loading

rok-cesnovar commented Nov 13, 2017 •

edited

Loading

rok-cesnovar commented Nov 13, 2017 •

edited

Loading

rok-cesnovar commented Nov 20, 2017 •

edited

Loading

SteveBronder commented Dec 13, 2017 •

edited

Loading

rok-cesnovar commented Dec 13, 2017 •

edited

Loading