Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rev for Cholesky on GPU #1118

Merged
merged 63 commits into from
Mar 2, 2019
Merged
Changes from 4 commits
Commits
Show all changes
63 commits
Select commit Hold shift + click to select a range
15be4dd
Moved over main rev function to branch and did simple cleaning
SteveBronder Dec 20, 2018
0196022
Moves Cholesky GPU into the cholesky file
SteveBronder Dec 26, 2018
ff32352
Merge branch 'gpu_cholesky_prim' into gpu-cholesky-rev
SteveBronder Dec 26, 2018
463fb32
Merge branch 'gpu-cholesky-rev' of https://github.com/bstatcomp/math …
SteveBronder Feb 3, 2019
4f156f4
Update rev cholosky to use new gpu functions
SteveBronder Feb 3, 2019
37d5b18
Fixup rev code
SteveBronder Feb 4, 2019
0d88984
update rev
SteveBronder Feb 4, 2019
ac03052
reverting the copy_tri kernel changes
rok-cesnovar Feb 4, 2019
a77afda
include header remove
rok-cesnovar Feb 4, 2019
66799ba
Merge branch 'gpu_cholesky_prim' into gpu-cholesky-rev
rok-cesnovar Feb 4, 2019
b87dffa
minor comments
rok-cesnovar Feb 4, 2019
37075df
Merge branch 'gpu_cholesky_prim' into gpu-cholesky-rev
rok-cesnovar Feb 9, 2019
2718f05
added rev opencl tests, failing
rok-cesnovar Feb 9, 2019
af7f16a
Merge branch 'gpu_cholesky_prim' into gpu-cholesky-rev
rok-cesnovar Feb 9, 2019
69e7d9c
further work on rev, still fails
rok-cesnovar Feb 9, 2019
5f0cc21
Update tests, now passing
SteveBronder Feb 10, 2019
1e27c13
missing ifdef in rev, caused fails without STAN_OPENCL
rok-cesnovar Feb 10, 2019
6105b19
fixed the multiply(Nx0, 0xM) bug
rok-cesnovar Feb 10, 2019
0363c08
removed the size>0 check
rok-cesnovar Feb 10, 2019
c56c5d7
Merge branch 'gpu_cholesky_prim' into gpu-cholesky-rev
rok-cesnovar Feb 10, 2019
cd65c9b
sets the tests to go off on the GPU for smaller cholesky sizes
SteveBronder Feb 11, 2019
eb33397
Merge remote-tracking branch 'origin/gpu_cholesky_prim' into gpu-chol…
SteveBronder Feb 12, 2019
b95beba
Merge commit 'a933f65c4f584fb2ff949b350816eae59b72ddfd' into HEAD
yashikno Feb 12, 2019
0bf89b4
[Jenkins] auto-formatting by clang-format version 5.0.0-3~16.04.1 (ta…
stan-buildbot Feb 12, 2019
a3c1650
Merge remote-tracking branch 'upstream/develop' into gpu-cholesky-rev
SteveBronder Feb 12, 2019
a82d581
Merge branch 'gpu-cholesky-rev' of https://github.com/bstatcomp/math …
SteveBronder Feb 12, 2019
96b2a40
Accidentally took out some tests that were supposed to be there
SteveBronder Feb 12, 2019
f2bb7c4
fix some simple math
SteveBronder Feb 12, 2019
8aa02ee
Adds bug fix so that GPU cholesky only works on type double. Also mak…
SteveBronder Feb 17, 2019
8aa1c10
Merge branch 'gpu_cholesky_prim' into gpu-cholesky-rev
rok-cesnovar Feb 18, 2019
c02c813
Merge branch 'develop' into gpu_cholesky_prim
rok-cesnovar Feb 18, 2019
4aa3029
Merge branch 'gpu_cholesky_prim' into gpu-cholesky-rev
rok-cesnovar Feb 18, 2019
9df78a4
[Jenkins] auto-formatting by clang-format version 5.0.0-3~16.04.1 (ta…
stan-buildbot Feb 18, 2019
316ba55
Revert "Merge branch 'gpu_cholesky_prim' into gpu-cholesky-rev"
rok-cesnovar Feb 18, 2019
d8b476a
addressing minor comments
rok-cesnovar Feb 18, 2019
decbb72
For cholesky_opencl in rev, Removes the M = M_, sets the inner loops …
SteveBronder Feb 18, 2019
705ef7e
merge to remote
SteveBronder Feb 18, 2019
9846f74
[Jenkins] auto-formatting by clang-format version 5.0.0-3~16.04.1 (ta…
stan-buildbot Feb 18, 2019
59ffbf9
merge to develop
SteveBronder Feb 20, 2019
7197dba
remove duplicate include
rok-cesnovar Feb 20, 2019
32560f5
Merge commit 'f0e38e580d1d34d91fb202b0134d76e9d92a5226' into HEAD
yashikno Feb 19, 2019
c0aeaf6
Merge remote-tracking branch 'origin/develop' into feature/issue-1058…
SteveBronder Feb 17, 2019
bd7b13b
Cleans up L<var> assignment in cholesky_decompose. Removes unneeded g…
SteveBronder Feb 21, 2019
671901b
Merge branch 'gpu-cholesky-rev' of https://github.com/bstatcomp/math …
SteveBronder Feb 21, 2019
2514def
[Jenkins] auto-formatting by clang-format version 5.0.0-3~16.04.1 (ta…
stan-buildbot Feb 21, 2019
e8a4e48
Fixes the math in the block size calculation for cholesky_block and m…
SteveBronder Feb 24, 2019
59aab65
merge to origin remote
SteveBronder Feb 24, 2019
483ce25
Merge commit '2850ec262181075a5843968e805dc9ad1654e069' into HEAD
yashikno Feb 24, 2019
95adc28
[Jenkins] auto-formatting by clang-format version 5.0.0-3~16.04.1 (ta…
stan-buildbot Feb 24, 2019
b92441a
Removes comment that was not truthful about not doing pos_def checks
SteveBronder Feb 24, 2019
bef42c7
check_symmetric with OpenCL in prim
rok-cesnovar Feb 24, 2019
0e2ef93
check_symmetric removed in rev for the OpenCL version, its a duplicat…
rok-cesnovar Feb 24, 2019
1e09d95
Merge remote-tracking branch 'upstream/develop' into gpu-cholesky-rev
SteveBronder Feb 27, 2019
9037b8d
removing camelCase and unwanted underscores
rok-cesnovar Feb 27, 2019
5c2e48d
added 2 tests below the threshold
rok-cesnovar Feb 27, 2019
c45bac0
bar goes adj
rok-cesnovar Feb 27, 2019
5117236
symbolic_rev added, better zeroing
rok-cesnovar Feb 27, 2019
1d17106
Merge commit 'c72d56f4d29d2cc5d3535ee433d7a9695fb681fc' into HEAD
yashikno Feb 27, 2019
d4c8646
[Jenkins] auto-formatting by clang-format version 5.0.0-3~16.04.1 (ta…
stan-buildbot Feb 27, 2019
d6d7fa0
Merge branch 'gpu-cholesky-rev' of https://github.com/bstatcomp/math …
SteveBronder Feb 27, 2019
d0c124f
Merge remote-tracking branch 'upstream/develop' into gpu-cholesky-rev
SteveBronder Mar 1, 2019
9784cc2
outsource to prim cholesky directly
rok-cesnovar Mar 1, 2019
0db6764
[Jenkins] auto-formatting by clang-format version 6.0.0 (tags/google/…
stan-buildbot Mar 1, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
204 changes: 202 additions & 2 deletions stan/math/rev/mat/fun/cholesky_decompose.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,14 @@
#include <stan/math/prim/mat/err/check_pos_definite.hpp>
#include <stan/math/prim/mat/err/check_square.hpp>
#include <stan/math/prim/mat/err/check_symmetric.hpp>
#include <stan/math/gpu/cholesky_decompose.hpp>
#include <stan/math/gpu/diagonal_multiply.hpp>
#include <stan/math/gpu/inverse.hpp>
#include <stan/math/gpu/matrix.hpp>
#include <stan/math/gpu/multiply.hpp>
#include <stan/math/gpu/err/check_square.hpp>
#include <stan/math/gpu/err/check_symmetric.hpp>
SteveBronder marked this conversation as resolved.
Show resolved Hide resolved

#include <algorithm>

namespace stan {
Expand Down Expand Up @@ -224,6 +232,171 @@ class cholesky_scalar : public vari {
}
};

class cholesky_gpu : public vari {
public:
int M_;
vari** variRefA_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use the naming rules from Google's style guide: https://google.github.io/styleguide/cppguide.html#General_Naming_Rules and https://google.github.io/styleguide/cppguide.html#Variable_Names

In particular, we don't use mixed case for variable names, preferring all lower case with underscores between words.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

variRefA_ and variRefL_ were replaced. There are similar cases in some other files, should I make a "cleanup mixed case" issue?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah that'd be great, thanks! In retrospect I should have mentioned this earlier probably, but I felt like we already had a lot to deal with that was more important. Now things are all really good and so we're getting into marginal returns 😂

vari** variRefL_;

/**
* Constructor for GPU cholesky function.
*
* Stores varis for A. Instantiates and stores varis for L.
* Instantiates and stores dummy vari for upper triangular part of var
* result returned in cholesky_decompose function call
*
* variRefL aren't on the chainable autodiff stack, only used for storage
* and computation. Note that varis for L are constructed externally in
* cholesky_decompose.
*
*
* @param A matrix
* @param L_A matrix, cholesky factor of A
*/
cholesky_gpu(const Eigen::Matrix<var, -1, -1>& A,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we switch entirely to "opencl" from "gpu?"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes totally. After this PR I'm going to also make another PR changing all the matrix_gpus to matrix_cl and rename the folder

const Eigen::Matrix<double, -1, -1>& L_A)
: vari(0.0),
M_(A.rows()),
variRefA_(ChainableStack::instance().memalloc_.alloc_array<vari*>(
A.rows() * (A.rows() + 1) / 2)),
variRefL_(ChainableStack::instance().memalloc_.alloc_array<vari*>(
A.rows() * (A.rows() + 1) / 2)) {
size_t pos = 0;
seantalts marked this conversation as resolved.
Show resolved Hide resolved
for (size_type j = 0; j < M_; ++j) {
for (size_type i = j; i < M_; ++i) {
variRefA_[pos] = A.coeffRef(i, j).vi_;
SteveBronder marked this conversation as resolved.
Show resolved Hide resolved
variRefL_[pos] = new vari(L_A.coeffRef(i, j), false);
++pos;
}
}
}

/**
* Reverse mode differentiation algorithm using a GPU
*
* Reference:
*
* Iain Murray: Differentiation of the Cholesky decomposition, 2016.
*
*/
virtual void chain() {
using Eigen::Block;
using Eigen::Lower;
using Eigen::MatrixXd;
using Eigen::StrictlyLower;
using Eigen::StrictlyUpper;
using Eigen::Upper;
MatrixXd Lbar(M_, M_);
MatrixXd L(M_, M_);
Lbar.setZero();
L.setZero();
size_t pos = 0;
for (size_type j = 0; j < M_; ++j) {
for (size_type i = j; i < M_; ++i) {
Lbar.coeffRef(i, j) = variRefL_[pos]->adj_;
L.coeffRef(i, j) = variRefL_[pos]->val_;
++pos;
}
}
matrix L(L);
matrix Lbar(Lbar);
int M = M_;
SteveBronder marked this conversation as resolved.
Show resolved Hide resolved
int block_size_ = 128;
block_size_ = std::max((M / 8 / 16) * 16, 8);
block_size_ = std::min(block_size_, 512);
SteveBronder marked this conversation as resolved.
Show resolved Hide resolved
if (M <= 256) {
block_size_ = M;
} else if (M <= 1024) {
block_size_ = 256;
} else if (M <= 4096) {
block_size_ = 352;
} else if (M <= 8192) {
block_size_ = 320;
}
for (int k = M; k > 0; k -= block_size_) {
int j = std::max(0, k - block_size_);
matrix R(k - j, j);
matrix D(k - j, k - j);
matrix Dinv(k - j, k - j);
matrix B(M - k, j);
matrix C(M - k, k - j);

matrix Rbar(k - j, j);
matrix Dbar(k - j, k - j);
matrix Dbar2(k - j, k - j);
matrix Bbar(M - k, j);
matrix Bbar2(M - k, j);
matrix Cbar(M - k, k - j);
matrix Cbar2(k - j, M - k);
matrix Cbar3(k - j, M - k);
matrix temp(k - j, j);

R.sub_block(L, j, 0, 0, 0, k - j, j);
D.sub_block(L, j, j, 0, 0, k - j, k - j);
B.sub_block(L, k, 0, 0, 0, M - k, j);
C.sub_block(L, k, j, 0, 0, M - k, k - j);

Rbar.sub_block(Lbar, j, 0, 0, 0, k - j, j);
Dbar.sub_block(Lbar, j, j, 0, 0, k - j, k - j);
Bbar.sub_block(Lbar, k, 0, 0, 0, M - k, j);
Cbar.sub_block(Lbar, k, j, 0, 0, M - k, k - j);

if (Cbar.size() > 0) {
copy(Dinv, D);
Dinv = lower_triangular_inverse(Dinv);
Dinv = transpose(Dinv);
Cbar2 = transpose(Cbar);

Cbar3 = multiply(Dinv, Cbar2);
Cbar = transpose(Cbar3);

Bbar2 = multiply(Cbar, R);
Bbar = subtract(Bbar, Bbar2);

Cbar3 = transpose(Cbar);
Dbar2 = multiply(Cbar3, C);
Dbar = subtract(Dbar, Dbar2);
}

D = transpose(D);
Dbar.zeros<gpu::Upper>();
Dbar2 = multiply(D, Dbar);
Dbar2.triangular_transpose<gpu::LowerToUpper>();
D = transpose(D);
D = lower_triangular_inverse(D);
D = transpose(D);
Dbar = multiply(D, Dbar2);
Dbar = transpose(Dbar);
Dbar2 = multiply(D, Dbar);

if (Cbar.size() > 0 && B.size() > 0) {
Cbar2 = transpose(Cbar);
temp = multiply(Cbar2, B);
Rbar = subtract(Rbar, temp);
}

if (Dbar.size() > 0 && R.size() > 0) {
copy(Dbar, Dbar2);
Dbar.triangular_transpose<gpu::LowerToUpper>();
temp = multiply(Dbar, R);
Rbar = subtract(Rbar, temp);
}
Dbar2 = diagonal_multiply(Dbar2, 0.5);
Dbar2.zeros<gpu::Upper>();

Lbar.sub_block(Rbar, 0, 0, j, 0, k - j, j);
Lbar.sub_block(Dbar2, 0, 0, j, j, k - j, k - j);
Lbar.sub_block(Bbar, 0, 0, k, 0, M - k, j);
Lbar.sub_block(Cbar, 0, 0, k, j, M - k, k - j);
}
copy(Lbar, Lbar);
pos = 0;
for (size_type j = 0; j < M_; ++j)
for (size_type i = j; i < M_; ++i)
variRefA_[pos++]->adj_ += Lbar.coeffRef(i, j);
}
};

/**
* Reverse mode specialization of cholesky decomposition
*
Expand All @@ -241,8 +414,17 @@ inline Eigen::Matrix<var, -1, -1> cholesky_decompose(
check_symmetric("cholesky_decompose", "A", A);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More on the debate on checks. For check_square I think leaving this here is fine, since performance wise this does the same thing (access to the the rows and cols members).

check_symmetric would be more performant if we did it with OpenCL, but that means that we would call check_symmetric at line 390 and before 393 but on L_A not A, so an unnecessary value_of_rec call happens if its actually not symmetric.

Perforamance wise this is nothing to worry about so I am fine leaving this as is for the sake of cleaner code.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the call to value_of_rec is also v expensive. I'll look at this later to see if there is a more clever way to do it but I'm fine with the CPU version

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also a good example of where Seans previous PR that saves flags for when these checks pass on the data would solve this without having to do any gpu stuff

SteveBronder marked this conversation as resolved.
Show resolved Hide resolved

Eigen::Matrix<double, -1, -1> L_A(value_of_rec(A));
Eigen::LLT<Eigen::Ref<Eigen::MatrixXd>, Eigen::Lower> L_factor(L_A);
check_pos_definite("cholesky_decompose", "m", L_factor);
#ifdef STAN_OPENCL
if (L_A.rows() > 1000) {
L_A = cholesky_decompose(L_A);
rok-cesnovar marked this conversation as resolved.
Show resolved Hide resolved
check_pos_definite("cholesky_decompose", "m", L_A);
} else {
#endif
Eigen::LLT<Eigen::Ref<Eigen::MatrixXd>, Eigen::Lower> L_factor(L_A);
check_pos_definite("cholesky_decompose", "m", L_factor);
#ifdef STAN_OPENCL
}
#endif

// Memory allocated in arena.
// cholesky_scalar gradient faster for small matrices compared to
Expand All @@ -264,7 +446,11 @@ inline Eigen::Matrix<var, -1, -1> cholesky_decompose(
accum += j;
accum_i = accum;
}
#ifdef STAN_OPENCL
} else if (L_A.rows() < 1000) {
#else
} else {
#endif
cholesky_block* baseVari = new cholesky_block(A, L_A);
size_t pos = 0;
for (size_type j = 0; j < L.cols(); ++j) {
Expand All @@ -274,7 +460,21 @@ inline Eigen::Matrix<var, -1, -1> cholesky_decompose(
for (size_type k = 0; k < j; ++k)
L.coeffRef(k, j).vi_ = dummy;
}
#ifdef STAN_OPENCL
} else {
cholesky_gpu* baseVari = new cholesky_gpu(A, L_A);
size_t pos = 0;
for (size_type j = 0; j < L.cols(); ++j) {
for (size_type i = j; i < L.cols(); ++i) {
L.coeffRef(i, j).vi_ = baseVari->variRefL_[pos++];
}
for (size_type k = 0; k < j; ++k)
L.coeffRef(k, j).vi_ = dummy;
}
}
#else
}
#endif
return L;
}
} // namespace math
Expand Down