Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GPU Cholesky Primitive #1059

Merged
merged 77 commits into from
Feb 12, 2019
Merged
Show file tree
Hide file tree
Changes from 14 commits
Commits
Show all changes
77 commits
Select commit Hold shift + click to select a range
12e5d8e
revised cholesky prim
rok-cesnovar Oct 28, 2018
ceee137
added comments & minor stuff
rok-cesnovar Oct 28, 2018
7528d50
inverse fixes and added function to /prim
rok-cesnovar Oct 29, 2018
8284c14
removed files
rok-cesnovar Oct 29, 2018
834ff4a
now passing
rok-cesnovar Oct 29, 2018
4c87e21
Merge remote-tracking branch 'upstream/develop' into gpu_cholesky_prim
SteveBronder Oct 30, 2018
27beacc
Adds to docs, cleans up some code, use auto and const where possible
SteveBronder Oct 30, 2018
87a67df
Fixes docs and changes name of test against cpu for fixed matrix
SteveBronder Oct 30, 2018
be110d2
[Jenkins] auto-formatting by clang-format version 5.0.0-3~16.04.1 (ta…
stan-buildbot Oct 30, 2018
287ae57
forgot to include algorithm, we need to fix that lint check
SteveBronder Oct 30, 2018
898ebca
Merge branch 'gpu_cholesky_prim' of https://github.com/bstatcomp/math…
SteveBronder Oct 30, 2018
4fdb965
include algorithm again
rok-cesnovar Oct 30, 2018
36bea8b
remove auto on return type
rok-cesnovar Oct 30, 2018
c126ceb
[Jenkins] auto-formatting by clang-format version 5.0.0-3~16.04.1 (ta…
stan-buildbot Oct 30, 2018
6c0daff
move check for square and symmetric to top of cholesky decompose prim
SteveBronder Nov 9, 2018
2d00cdf
[Jenkins] auto-formatting by clang-format version 6.0.0 (tags/google/…
stan-buildbot Nov 9, 2018
e8f9540
Move STAN_OPENCL check to above includes for both tests
SteveBronder Nov 10, 2018
6b57d85
Remove unneded comments, removes the call to the zero kernel and fill…
SteveBronder Nov 15, 2018
1a4a17b
fixed the zeroing in the kernel
rok-cesnovar Nov 15, 2018
36a6d4a
moved the recursion or kernel step to a separate function
rok-cesnovar Nov 15, 2018
10c4f59
[Jenkins] auto-formatting by clang-format version 5.0.0-3~16.04.1 (ta…
stan-buildbot Nov 15, 2018
a781a96
Merge branch 'develop' into gpu_cholesky_prim
rok-cesnovar Nov 16, 2018
88c26a0
using operators for * and -
rok-cesnovar Nov 16, 2018
cc187e6
added tests for edge cases of lower_tri_inverse
rok-cesnovar Nov 16, 2018
0115c85
[Jenkins] auto-formatting by clang-format version 5.0.0-3~16.04.1 (ta…
stan-buildbot Nov 16, 2018
6bea215
Merge remote-tracking branch 'upstream/develop' into gpu_cholesky_prim
SteveBronder Nov 17, 2018
b3ef66d
Merge branch 'gpu_cholesky_prim' of https://github.com/bstatcomp/math…
SteveBronder Nov 17, 2018
1c9bf89
Adds size checks to the cholesky and removes the part in the docs abo…
SteveBronder Nov 18, 2018
763aca2
merge to dev
SteveBronder Dec 26, 2018
021d87d
merge to dev
SteveBronder Dec 26, 2018
0422c04
adds floor and uses half the rows of m in chol for the starting block…
SteveBronder Dec 29, 2018
c120223
Merge pull request #8 from bstatcomp/gpu-chol-prim-floor
SteveBronder Dec 29, 2018
5bc8b71
[Jenkins] auto-formatting by clang-format version 5.0.0-3~16.04.1 (ta…
stan-buildbot Dec 29, 2018
774aef5
merge to develop
SteveBronder Jan 4, 2019
58666aa
[Jenkins] auto-formatting by clang-format version 5.0.0-3~16.04.1 (ta…
stan-buildbot Jan 4, 2019
da52c49
Remove extra ifdef in prim
SteveBronder Jan 23, 2019
dc14da2
Remove extra ifdef in prim
SteveBronder Jan 23, 2019
a736014
Removes tuning parameters from gpu cholesky and recursion
SteveBronder Jan 24, 2019
6649984
Adds argument in cholesky_gpu for the min block size
SteveBronder Jan 24, 2019
7d68799
Merge commit 'fb2cc51188b0171d70d63d5ef8be44998f5b3814' into HEAD
yashikno Jan 24, 2019
f557e47
[Jenkins] auto-formatting by clang-format version 5.0.0-3~16.04.1 (ta…
stan-buildbot Jan 24, 2019
d71e4a6
replace | with || in if for cholesky recursion
SteveBronder Jan 24, 2019
d9675c5
Fix docs
SteveBronder Jan 24, 2019
775f7e4
[Jenkins] auto-formatting by clang-format version 5.0.0-3~16.04.1 (ta…
stan-buildbot Jan 24, 2019
0be4541
Restructure GPU cholesky to get rid of the explicit recursion functio…
SteveBronder Jan 26, 2019
1e90855
[Jenkins] auto-formatting by clang-format version 5.0.0-3~16.04.1 (ta…
stan-buildbot Jan 26, 2019
d461d06
1. Places recursive cholesky in the internal namespace. This
SteveBronder Jan 27, 2019
bb34aec
Merge branch 'gpu_cholesky_prim' of https://github.com/bstatcomp/math…
SteveBronder Jan 27, 2019
d7d09e7
[Jenkins] auto-formatting by clang-format version 5.0.0-3~16.04.1 (ta…
stan-buildbot Jan 27, 2019
d3bd266
Fix ifdef in chol prim
SteveBronder Jan 27, 2019
1b65f2a
Fix docs to match code
SteveBronder Jan 28, 2019
558f4f7
Changes the min_block logic to be a bit more clever. Catch local size…
SteveBronder Jan 29, 2019
85eb0f3
[Jenkins] auto-formatting by clang-format version 5.0.0-3~16.04.1 (ta…
stan-buildbot Jan 29, 2019
907f1ec
Minor comment cleanups, remove excessive auto, use max_thread_block_…
SteveBronder Jan 30, 2019
dc24e9f
[Jenkins] auto-formatting by clang-format version 5.0.0-3~16.04.1 (ta…
stan-buildbot Jan 30, 2019
2f8be5b
Merge remote-tracking branch 'upstream/develop' into gpu_cholesky_prim
SteveBronder Jan 31, 2019
89ebee8
Merge remote-tracking branch 'upstream/develop' into gpu_cholesky_prim
SteveBronder Feb 1, 2019
7a146e4
Merge remote-tracking branch 'upstream/develop' into gpu_cholesky_prim
SteveBronder Feb 3, 2019
fbb97a3
Update so that Cholesky pulls the required size to run the Chol kerne…
SteveBronder Feb 3, 2019
f594437
Make a catch for old devices with small max thread block size for cho…
SteveBronder Feb 3, 2019
6c26574
[Jenkins] auto-formatting by clang-format version 5.0.2-svn328729-1~e…
stan-buildbot Feb 3, 2019
d2285e0
Hard coded values for Cholesky tuning are now moved to the opencl con…
SteveBronder Feb 3, 2019
8b04db9
Update Opencl context
SteveBronder Feb 3, 2019
6bc3eba
[Jenkins] auto-formatting by clang-format version 5.0.2-svn328729-1~e…
stan-buildbot Feb 3, 2019
749e21c
update context and cholesky docs
SteveBronder Feb 3, 2019
1d8a707
Merge branch 'gpu_cholesky_prim' of https://github.com/bstatcomp/math…
SteveBronder Feb 3, 2019
89d1c21
removed zeroing, changed to >= in the size check
rok-cesnovar Feb 4, 2019
9215ebc
Fix names for opencl cholesky and add tests for tuning parameters
SteveBronder Feb 9, 2019
a0b4597
Merge commit '70edefd8b0f009e3d657f9488fe4f1cedb823ba1' into HEAD
yashikno Feb 9, 2019
dcc8c62
[Jenkins] auto-formatting by clang-format version 5.0.0-3~16.04.1 (ta…
stan-buildbot Feb 9, 2019
32dd8ca
forgot to include vector in opencl chol test
SteveBronder Feb 9, 2019
f611fc7
forgot to remove doc for opencl_context.tuning_opts()
SteveBronder Feb 9, 2019
15fbbe1
prim was not using the return value
rok-cesnovar Feb 9, 2019
b704bce
added zeroing for multiply(Nx0,0xM)
rok-cesnovar Feb 10, 2019
08b19f0
[Jenkins] auto-formatting by clang-format version 5.0.0-3~16.04.1 (ta…
stan-buildbot Feb 10, 2019
a3f8881
Reduce and broaden chol test
seantalts Feb 11, 2019
96118c8
Fix runTests.py argument parsing
seantalts Feb 11, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
132 changes: 132 additions & 0 deletions stan/math/gpu/cholesky_decompose.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
#ifndef STAN_MATH_GPU_CHOLESKY_DECOMPOSE_HPP
#define STAN_MATH_GPU_CHOLESKY_DECOMPOSE_HPP
#ifdef STAN_OPENCL
#include <stan/math/gpu/matrix_gpu.hpp>
#include <stan/math/gpu/kernels/cholesky_decompose.hpp>
#include <stan/math/gpu/multiply.hpp>
#include <stan/math/gpu/multiply_transpose.hpp>
#include <stan/math/gpu/lower_tri_inverse.hpp>
#include <stan/math/gpu/transpose.hpp>
#include <stan/math/gpu/subtract.hpp>
#include <stan/math/gpu/err/check_diagonal_zeros.hpp>
#include <stan/math/gpu/err/check_nan.hpp>
#include <CL/cl.hpp>
#include <algorithm>
namespace stan {
namespace math {
/**
* Return the lower-triangular Cholesky factor (i.e., matrix
* square root) of the specified square, symmetric matrix.
* The return value \f$L\f$ will be a lower-traingular matrix such that the
* original matrix \f$A\f$ is given by
* <p>\f$A = L \times L^T\f$.
* The Cholesky decomposition is computed on the GPU. This algorithm is
* recursive, where The parameters <code>block</code>, <code>divider</code>, and
* <code>min_block</code> act as tuning parameters for the recursive step of
* the GPU based Cholesky decompostion. The matrix is subset by the
* <code>block</code> size, and if the <code>block</code> size is less than
* <code>min_block</code> then the cholesky decomposition on the GPU is computed
* using that submatrix. If <code>block</code> is greater than
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't have <code>block</code> anymore :/

* <code>block_size</code> then <code>cholesky_decompose</code> is run again
* with <code>block</code> equal to <code>block/divider</code>. Once the
* Cholesky Decomposition is computed, the full matrix cholesky is created
* by propogating the cholesky forward as given in the reference report below.
*
* For a full guide to how this works
* see the Cholesy decompostion chapter in the reference report
* <a href="https://goo.gl/6kWkJ5"> here</a>.
* @param A Symmetric matrix on the GPU.
* @param block Size of the block used to compute the cholesky decomposition.
* @param divider Proportion to divide the submatrix by at each recursive step.
* @param min_block The amount that block is checked against to decide
* whether to continue the recursion or to perform the cholesky.
* @return Square root of matrix on the GPU.
* @throw std::domain_error if m is not
* positive definite (if m has more than 0 elements)
*/
inline matrix_gpu cholesky_decompose(matrix_gpu& A, const int block = 100,
const int divider = 2,
const int min_block = 100) {
auto offset = 0;
// NOTE: The code in this section follows the naming conventions
// in the report linked in the docs.
matrix_gpu A_11(block, block);
matrix_gpu L_11(block, block);
// Repeats the blocked cholesky decomposition until the size of the remaining
// submatrix is smaller or equal to the block size
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment still accurate? It seems like this while loop is not actually breaking down A into smaller subproblems but rather iterating through all of the blocks of size block in A, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the comment is accurate, but worded weirdly. I'll clean this up

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually reading that comment it's better to change the comment on line 108 so that it references the remaining submatrix. Then readers can get that the part here runs until the block size exceeds the size of the matrix then the part of the matrix remaining is computed below

while ((offset + block) < (A.rows())) {
SteveBronder marked this conversation as resolved.
Show resolved Hide resolved
auto block_subset = A.rows() - offset - block;
matrix_gpu A_21(block_subset, block);
matrix_gpu A_22(block_subset, block_subset);
// Copies the A_11 submatrix from the input
SteveBronder marked this conversation as resolved.
Show resolved Hide resolved
A_11.sub_block(A, offset, offset, 0, 0, block, block);
// Calls the blocked cholesky for the submatrix A_11
// or calls the kernel directly if the size of the block is small enough
if (block <= min_block || divider <= 1) {
L_11.zeros();
try {
opencl_kernels::cholesky_decompose(
cl::NDRange(A_11.rows()), cl::NDRange(A_11.rows()), A_11.buffer(),
L_11.buffer(), A_11.rows());
} catch (const cl::Error& e) {
check_opencl_error("cholesky_decompose", e);
}
} else {
L_11 = cholesky_decompose(A_11, block / divider, divider, min_block);
}
// Copies the cholesky factor of A_11 back to the input matrix
SteveBronder marked this conversation as resolved.
Show resolved Hide resolved
A.sub_block(L_11, 0, 0, offset, offset, block, block);
// Copies the A_21 submatrix from the input matrix,
auto block_offset = offset + block;
A_21.sub_block(A, block_offset, offset, 0, 0, block_subset, block);
// computes A_21*((L_11^-1)^T)
// and copies the resulting submatrix to the input matrix
matrix_gpu A_11_inverse = lower_triangular_inverse(L_11);
A_11_inverse = transpose(A_11_inverse);
SteveBronder marked this conversation as resolved.
Show resolved Hide resolved
// TODO(Steve): Replace with mult operator when that PR goes through
matrix_gpu L_21 = multiply(A_21, A_11_inverse);
A.sub_block(L_21, 0, 0, block_offset, offset, block_subset, block);
// Copies the A_22 submatrix from the input matrix,
SteveBronder marked this conversation as resolved.
Show resolved Hide resolved
A_22.sub_block(A, block_offset, block_offset, 0, 0, block_subset,
SteveBronder marked this conversation as resolved.
Show resolved Hide resolved
block_subset);
// computes A_22 - L_21*(L_21^T)
// and copies the resulting submatrix back to the input matrix
matrix_gpu temp = multiply_transpose(L_21);
// TODO(Steve): Replace with subtraction operator when that PR goes through
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Steve, it may also be cool to code up a way to assign directly to a sub block. We should probably attempt to match Eigen on something like that, though perhaps that's a large ask as their API looks fairly complicated and we may not even want that complication.

But somehow assigning directly to a sub-block would really slick up this API.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that would be rad!! I've thought about this in passing but having tried anything out yet.

For lvalue idt it would be too bad. Off the top of my head something like a templated struct in the matrix_gpu class called block that overloads the assignment operator and then internally calls sub_block.then we make sub_block private and only expose the block api.

Have to think about rvalue, something about overloaded ()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you say a little more about the rvalue problem?

Copy link
Collaborator Author

@SteveBronder SteveBronder Nov 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

¯\_(ツ)_/¯

Not really a problem, just that I haven't tried anything or thought about it

matrix_gpu L_22 = subtract(A_22, temp);
A.sub_block(L_22, 0, 0, block_offset, block_offset, block_subset,
block_subset);
offset += block;
}
// Computes the Cholesky factor for the remaining part of the matrix
SteveBronder marked this conversation as resolved.
Show resolved Hide resolved
const auto remaining_rows = A.rows() - offset;
if (remaining_rows > 0) {
matrix_gpu A_11(remaining_rows, remaining_rows);
matrix_gpu L_11(remaining_rows, remaining_rows);
A_11.sub_block(A, offset, offset, 0, 0, remaining_rows, remaining_rows);
// Calls the blocked cholesky for the submatrix A_11
// or calls the kernel directly if the size of the block is small enough
if (block <= min_block || divider <= 1) {
L_11.zeros();
try {
opencl_kernels::cholesky_decompose(
cl::NDRange(A_11.rows()), cl::NDRange(A_11.rows()), A_11.buffer(),
L_11.buffer(), A_11.rows());
} catch (const cl::Error& e) {
check_opencl_error("cholesky_decompose", e);
}
} else {
L_11 = cholesky_decompose(A_11, block / divider);
}
A.sub_block(L_11, 0, 0, offset, offset, remaining_rows, remaining_rows);
}
check_nan("cholesky_decompose_gpu", "Matrix m", A);
check_diagonal_zeros("cholesky_decompose_gpu", "Matrix m", A);
A.zeros<stan::math::TriangularViewGPU::Upper>();
return A;
seantalts marked this conversation as resolved.
Show resolved Hide resolved
}
} // namespace math
} // namespace stan

#endif
#endif
74 changes: 74 additions & 0 deletions stan/math/gpu/kernels/cholesky_decompose.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
#ifndef STAN_MATH_GPU_KERNELS_CHOLESKY_DECOMPOSE_HPP
#define STAN_MATH_GPU_KERNELS_CHOLESKY_DECOMPOSE_HPP
#ifdef STAN_OPENCL

#include <stan/math/gpu/kernel_cl.hpp>

namespace stan {
namespace math {
namespace opencl_kernels {
// \cond
const char *cholesky_decompose_kernel_code = STRINGIFY(
// \endcond
/**
* Calculates the Cholesky Decomposition of a matrix on a GPU
*
* This kernel is run with threads organized in one dimension and
* in a single thread block. The kernel is best suited for
* small input matrices as it only utilizes a single streaming
* multiprocessor. The kernels is used as a part of a blocked
* cholesky decompose.
*
* @param[in] A The input matrix
* @param[in, out] B The result of cholesky decompositon of A.
* It must be set to zeros before execution of this kernel. *
SteveBronder marked this conversation as resolved.
Show resolved Hide resolved
* @param rows The number of rows for A and B.
* @note Code is a <code>const char*</code> held in
* <code>cholesky_decompose_kernel_code.</code>
* Used in math/gpu/cholesky_decompose.hpp.
* This kernel uses the helper macros available in helpers.cl.
*
*/
__kernel void cholesky_decompose(__global double *A, __global double *B,
int rows) {
int local_index = get_local_id(0);
// The following code is the sequential version of the non-inplace
// cholesky decomposition. Only the innermost loops are parallelized. The
SteveBronder marked this conversation as resolved.
Show resolved Hide resolved
// rows are processed sequentially. This loop process all the rows:
for (int j = 0; j < rows; j++) {
// First thread calculates the diagonal element
if (local_index == 0) {
double sum = 0;
for (int k = 0; k < j; k++) {
sum += B(j, k) * B(j, k);
SteveBronder marked this conversation as resolved.
Show resolved Hide resolved
}
B(j, j) = sqrt(A(j, j) - sum);
SteveBronder marked this conversation as resolved.
Show resolved Hide resolved
}
barrier(CLK_LOCAL_MEM_FENCE);
// Calculates the rest of the row
if (local_index >= (j + 1) && local_index < rows) {
double inner_sum = 0;
for (int k = 0; k < j; k++) {
inner_sum += B(local_index, k) * B(j, k);
}
B(local_index, j) = (1.0 / B(j, j) * (A(local_index, j) - inner_sum));
}
barrier(CLK_LOCAL_MEM_FENCE);
SteveBronder marked this conversation as resolved.
Show resolved Hide resolved
}
}
// \cond
);
// \endcond

/**
* See the docs for \link kernels/cholesky_decompose.hpp cholesky_decompose()
* \endlink
*/
const local_range_kernel<cl::Buffer, cl::Buffer, int> cholesky_decompose(
"cholesky_decompose", cholesky_decompose_kernel_code);

} // namespace opencl_kernels
} // namespace math
} // namespace stan
#endif
#endif
11 changes: 8 additions & 3 deletions stan/math/gpu/kernels/inv_lower_tri_multiply.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -77,14 +77,19 @@ const char* inv_lower_tri_multiply_kernel_code = STRINGIFY(
const int local_col = thread_block_col + w * THREAD_BLOCK_SIZE_COL;
const int local_row = thread_block_row;
// Element above the diagonal will not be transferred.
if (C2_global_col <= C2_global_row) {
if (C2_global_col <= C2_global_row && C2_global_col < A_rows
&& C2_global_row < A_rows) {
C2_local[local_col][local_row]
= A[C2_global_col * A_rows + C2_global_row];
} else {
C2_local[local_col][local_row] = 0;
}
A3_local[local_col][local_row]
= A[A3_global_col * A_rows + A3_global_row];
if (A3_global_col < A_rows && A3_global_row < A_rows) {
A3_local[local_col][local_row]
= A[A3_global_col * A_rows + A3_global_row];
} else {
A3_local[local_col][local_row] = 0.0;
}
}
// Wait until all tile values are loaded to the local memory
barrier(CLK_LOCAL_MEM_FENCE);
Expand Down
7 changes: 5 additions & 2 deletions stan/math/gpu/kernels/neg_rect_lower_tri_multiply.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,8 @@ const char* neg_rect_lower_tri_multiply_kernel_code = STRINGIFY(
temp_local[local_col][local_row] = 0.0;
}
// Element above the diagonal will not be transferred.
if (C1_global_col <= C1_global_row) {
if (C1_global_col <= C1_global_row && C1_global_col < A_rows
&& C1_global_row < A_rows) {
C1_local[local_col][local_row]
= A[C1_global_col * A_rows + C1_global_row];
} else {
Expand Down Expand Up @@ -102,7 +103,9 @@ const char* neg_rect_lower_tri_multiply_kernel_code = STRINGIFY(
for (int w = 0; w < WORK_PER_THREAD; w++) {
const int A_global_col
= A_global_col_offset + w * THREAD_BLOCK_SIZE_COL;
A[A_global_col * A_rows + i + rows + offset] = -acc[w];
if (A_global_col < A_rows && (i + rows + offset) < A_rows) {
A[A_global_col * A_rows + i + rows + offset] = -acc[w];
rok-cesnovar marked this conversation as resolved.
Show resolved Hide resolved
}
}
}
// \cond
Expand Down
12 changes: 12 additions & 0 deletions stan/math/prim/mat/fun/cholesky_decompose.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,10 @@
#include <stan/math/prim/mat/err/check_pos_definite.hpp>
#include <stan/math/prim/mat/err/check_square.hpp>
#include <stan/math/prim/mat/err/check_symmetric.hpp>
#ifdef STAN_OPENCL
#include <stan/math/gpu/cholesky_decompose.hpp>
#include <algorithm>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need to include this here? is this for copy? Should probably include algorithm in our copy header and use that instead, right?

Copy link
Collaborator Author

@SteveBronder SteveBronder Nov 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this for copy?

Yeah I'm pretty sure this is because cpplint complains when it sees copy without the algorithm header (even if the call is stan::math::copy()). Can we change that in cpplint?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the ccplint complained with the error message to add .

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

boo.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@seantalts is this cool? I can look later to see if there is a way to get that check for copy to only happen if it's not from stan::math

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be slightly better to try //NOLINT ing it (because then the programmer reading this code isn't thinking we're using code from <algorithm>), but I don't feel too strongly either way.

#endif

namespace stan {
namespace math {
Expand All @@ -23,12 +27,20 @@ namespace math {
template <typename T>
Eigen::Matrix<T, Eigen::Dynamic, Eigen::Dynamic> cholesky_decompose(
const Eigen::Matrix<T, Eigen::Dynamic, Eigen::Dynamic>& m) {
#ifdef STAN_OPENCL
seantalts marked this conversation as resolved.
Show resolved Hide resolved
matrix_gpu m_gpu(m);
Eigen::Matrix<T, Eigen::Dynamic, Eigen::Dynamic> m_chol(m.rows(), m.cols());
cholesky_decompose(m_gpu);
copy(m_chol, m_gpu);
return m_chol;
#else
check_square("cholesky_decompose", "m", m);
check_symmetric("cholesky_decompose", "m", m);
Eigen::LLT<Eigen::Matrix<T, Eigen::Dynamic, Eigen::Dynamic> > llt(m.rows());
llt.compute(m);
check_pos_definite("cholesky_decompose", "m", llt);
return llt.matrixL();
#endif
}

} // namespace math
Expand Down
79 changes: 79 additions & 0 deletions test/unit/math/gpu/cholesky_decompose_test.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
#ifdef STAN_OPENCL
#include <stan/math/prim/mat.hpp>
#include <stan/math/gpu/copy.hpp>
#include <stan/math/gpu/cholesky_decompose.hpp>
#include <stan/math/prim/mat/fun/cholesky_decompose.hpp>
#include <stan/math/prim/mat/fun/Eigen.hpp>
#include <stan/math/prim/mat/err/check_pos_definite.hpp>
#include <stan/math/prim/mat/err/check_square.hpp>
#include <stan/math/prim/mat/err/check_symmetric.hpp>
#include <gtest/gtest.h>
#include <algorithm>
#define EXPECT_MATRIX_NEAR(A, B, DELTA) \
for (int i = 0; i < A.size(); i++) \
EXPECT_NEAR(A(i), B(i), DELTA);

TEST(MathMatrix, cholesky_decompose_cpu_vs_gpu_small) {
stan::math::matrix_d m0(3, 3);
m0 << 25, 15, -5, 15, 18, 0, -5, 0, 11;

stan::math::matrix_d m1(4, 4);
m1 << 18, 22, 54, 42, 22, 70, 86, 62, 54, 86, 174, 134, 42, 62, 134, 106;

stan::math::matrix_gpu m0_gpu(m0);
stan::math::matrix_gpu m1_gpu(m1);

stan::math::matrix_d m0_res = stan::math::cholesky_decompose(m0);
stan::math::matrix_d m1_res = stan::math::cholesky_decompose(m1);

stan::math::matrix_gpu m0_chol_gpu = stan::math::cholesky_decompose(m0_gpu);
seantalts marked this conversation as resolved.
Show resolved Hide resolved
SteveBronder marked this conversation as resolved.
Show resolved Hide resolved
stan::math::matrix_gpu m1_chol_gpu = stan::math::cholesky_decompose(m1_gpu);

stan::math::copy(m0, m0_chol_gpu);
stan::math::copy(m1, m1_chol_gpu);

EXPECT_MATRIX_NEAR(m0, m0_res, 1e-8);
EXPECT_MATRIX_NEAR(m1, m1_res, 1e-8);
}

void cholesky_decompose_test(int size) {
stan::math::matrix_d m1 = stan::math::matrix_d::Random(size, size);
stan::math::matrix_d m1_pos_def
= m1 * m1.transpose() + size * Eigen::MatrixXd::Identity(size, size);

stan::math::matrix_d m1_cpu(size, size);
stan::math::matrix_d m1_cl(size, size);

stan::math::check_square("cholesky_decompose", "m", m1_pos_def);
stan::math::check_symmetric("cholesky_decompose", "m", m1_pos_def);
Eigen::LLT<stan::math::matrix_d> llt(m1_pos_def.rows());
llt.compute(m1_pos_def);
stan::math::check_pos_definite("cholesky_decompose", "m", llt);
m1_cpu = llt.matrixL();

m1_cl = stan::math::cholesky_decompose(m1_pos_def);

double max_error = 0;
for (int i = 0; i < size; i++) {
for (int j = 0; j <= i; j++) {
double abs_err = std::fabs(m1_cpu(i, j) - m1_cl(i, j));
double a = std::max(abs_err / m1_cpu(i, j), abs_err / m1_cl(i, j));
max_error = std::max(max_error, a);
}
}
EXPECT_LT(max_error, 1e-8);
}

TEST(MathMatrix, cholesky_decompose_small) {
cholesky_decompose_test(10);
cholesky_decompose_test(50);
cholesky_decompose_test(100);
}

TEST(MathMatrix, cholesky_decompose_big) {
cholesky_decompose_test(500);
cholesky_decompose_test(1000);
cholesky_decompose_test(2000);
}

#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we end up supporting the block, divisor, and min_block arguments to cholesky_decompose can we add a couple of test cases showing how they should work?