-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] [WIP] [WIP] a way of kerneling #7
Conversation
I started messing with this and like how it looks to users. And the more I look at it the more it makes more sense. I tried putting variadic templating around the operator and make_functor and got it working. But I couldn't figure out the templating for the row and column range that we put into template<typename... Args>
auto make_functor() const {
return cl::make_kernel<Args...>(
kernel_cl::compile(name_, source_));
};
// I couldn't figure out a better place to put row_range and col_range
template<typename... Args>
void operator()(int row_range, int col_range, Args&&... args) const {
auto f = make_functor<Args...>();
f(cl::EnqueueArgs(opencl_context.queue(), cl::NDRange(row_range, col_range)),
args...).wait();
} |
Sweet. I also played around with different variadic template argument ideas, but it's a delicate balance between 1) ease for kernel developers 2) ease for normal math library users 3) configuration for power users or whoever will come in and set their own In some sense what you've got above is an abstraction at a slightly higher level than |
I played around with it more and started adding another kernel and I see why it's attractive to try to abstract more. Here are three options from most to least shared code: template <typename ...FnArgs>
class basic_kernel {
private:
const char* name_;
const char* source_;
public:
basic_kernel(const char* name, const char* source): name_(name), source_(source) {};
auto make_functor() const {
return cl::make_kernel<FnArgs...>(
kernel_cl::compile(name_, source_));
};
template <typename ...OpArgs>
void operator()(int rows, int cols, OpArgs... args) const {
auto f = make_functor();
f(cl::EnqueueArgs(opencl_context.queue(), cl::NDRange(rows, cols)),
args...).wait();
}
};
class unary_kernel : public basic_kernel<cl::Buffer, cl::Buffer, int, int> {
public:
unary_kernel(const char* name, const char* source): basic_kernel(name, source) {};
void operator()(cl::Buffer src, cl::Buffer dest, int rows, int cols) const {
basic_kernel::operator()<cl::Buffer, cl::Buffer, int, int>(
rows, cols, src, dest, rows, cols);
}
};
class self_modifying_kernel : public basic_kernel<cl::Buffer, int, int> {
self_modifying_kernel(const char* name, const char* source) :
basic_kernel(name, source) {};
void operator()(cl::Buffer src, int rows, int cols) const {
basic_kernel::operator()<cl::Buffer, int, int>(
rows, cols, src, rows, cols);
}
};
class unary_kernel2 : public basic_kernel<cl::Buffer, cl::Buffer, int, int> {
unary_kernel2(const char* name, const char* source): basic_kernel(name, source) {};
void operator()(cl::Buffer src, cl::Buffer dest, int rows, int cols) const {
auto f = make_functor();
cl::EnqueueArgs eargs(opencl_context.queue(), cl::NDRange(rows, cols));
f(eargs, src, dest, rows, cols).wait();
}
};
class self_modifying_kernel2 : public basic_kernel<cl::Buffer, int, int> {
self_modifying_kernel2(const char* name, const char* source)
: basic_kernel(name, source) {};
void operator()(cl::Buffer src, int rows, int cols) const {
auto f = make_functor();
cl::EnqueueArgs eargs(opencl_context.queue(), cl::NDRange(rows, cols));
f(eargs, src, rows, cols).wait();
}
}; class unary_kernel3 {
private:
const char* name_;
const char* source_;
public:
unary_kernel3(const char* name, const char* source): name_(name), source_(source) {};
auto make_functor() const {
return cl::make_kernel<cl::Buffer, cl::Buffer, int, int>(
kernel_cl::compile(name_, source_));
};
void operator()(cl::Buffer src, cl::Buffer dest, int rows, int cols) const {
auto f = make_functor();
f(cl::EnqueueArgs(opencl_context.queue(), cl::NDRange(rows, cols)),
src, dest, rows, cols).wait();
}
};
class self_modifying_kernel3 {
private:
const char* name_;
const char* source_;
public:
self_modifying_kernel3(const char* name, const char* source)
: name_(name), source_(source) {};
auto make_functor() const {
return cl::make_kernel<cl::Buffer, int, int>(
kernel_cl::compile(name_, source_));
};
void operator()(cl::Buffer src, int rows, int cols) const {
auto f = make_functor();
f(cl::EnqueueArgs(opencl_context.queue(), cl::NDRange(rows, cols)),
src, rows, cols).wait();
}
}; I think 1 is very confusing, I'm leaning towards 2 or 3 (not trying to abstract |
Oh man, I was trying to figure out how to explain all the template <typename ...Args>
class kernel_functor {
private:
const char* name_;
const char* source_;
public:
kernel_functor(const char* name, const char* source)
: name_(name), source_(source) {};
auto operator()() const {
return cl::make_kernel<Args...>(kernel_cl::compile(name_, source_));
};
};
struct unary_kernel4 {
kernel_functor<cl::Buffer, cl::Buffer, int, int> make_functor;
unary_kernel4(const char* name, const char* source) : make_functor(name, source) {};
void operator()(cl::Buffer src, cl::Buffer dest, int rows, int cols) const {
auto f = make_functor();
cl::EnqueueArgs eargs(opencl_context.queue(), cl::NDRange(rows, cols));
f(eargs, src, dest, rows, cols).wait();
}
}; Now The ugliness in trying to abstract |
I think if we wanted to do a more abstract thing (maybe we do?) we'd want to rewrite all of the applicable kernels so that the parameters we need to make the NDRange (most often or maybe always the rows and cols) come first (last wouldn't work with parameter packs). Then it'd be pretty easy... Maybe we should do that? |
These really looks nice! Thanks Sean for the efforts. Really apprecitated!! If rewriting would make things easier I am all for it. We will mostly have the rows&cols NDRange parameters, we do sometimes have 1D or 3D ranges however. What if the first parameter was simply cl::NDRange? And we would could also have 2 cl::NDRange parameters if we had kernels where we want to specify the thread block size (work-group size)? |
So far I like it! Agree with rock on having the first arg be NDRange |
Hm, that could work. I want to write out a little bit of stuff I learned by playing with it all and then have a question about the NDRange arg.
|
|
I see that we're creating a non-default queue (and context?) - maybe we can find a way to set those to default and then not have to access them from the re: NDRange first arg - it is bothering slightly in the kind of code golf perfectionist way that someone would call add like this: opencl_kernel::add(cl::NDRange(rows, cols), A, B, C, rows, cols); instead of opencl_kernel::add(rows, cols, A, B, C); |
I put up partial refactoring towards the 2nd version. It's not so bad. What do you guys think of these two options? Should we just go with something? Sorry for spending so much time messing around with this interface. |
I went through the OpenCL C++ API docks and I see a mention of the default context, but I don't see a lot of info about it. Can you show me where you are seeing that? I'm personally more in favor of the first option with |
Also totally no worries, I really appreciate the effort! It's awesome to have your input on the design |
Hm, I'm confused about
|
PS if you guys prefer the NDRange version I'm happy with that too. I'm a bit on the fence and was just going with the 2nd because it's slightly less typing, but maybe it's also more confusing? |
These is an error in the doc :(, I apologize for that. And thanks for finding it. I think this is some copy paste that was never fixed :/ I would maybe also prefer the 1st option as we would be able to use it for kernel like the multiply_diagonal_with_scalar (which is coming up) that spawns A.rows() threads where each thread multiplies one diagonal element. We could simply use Also for the GPU kernels like the cov_exp_quad one that spawn x.size() times x.size() threads, we would also be able to reuse it with |
Sounds good!
On Tue, Aug 21, 2018 at 10:14 Rok Češnovar ***@***.***> wrote:
These is an error in the doc :(, I apologize for that. And thanks for
finding it. I think this is some copy paste that was never fixed :/
I would maybe also prefer the 1st option as we would be able to use it for
kernel like the multiply_diagonal_with_scalar (which is coming up) that
spawns A.rows() threads where each thread multiplies one diagonal element.
We could simply use ( NDRange( A.rows() ), A, rows, cols) ).
Also for the GPU kernels like the cov_exp_quad one that spawn x.size()
times x.size() threads, we would also be able to reuse it with NDRange(
x.size(), x.size() ), x, x.size()
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#7 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAxJ7FhEfzWC_d5wUzqHpgHNkPsUy0G-ks5uS8FggaJpZM4WCvZC>
.
--
(phoning it in ;))
|
I'm not sure about whether or not we'd want to use the default context... but I saw here that we could use the default context and queue in a bunch of places and just avoid passing arguments and just use the functor creation thing maybe pretty directly. Concretely, here's how the code changes: diff --git a/stan/math/gpu/kernel_cl.hpp b/stan/math/gpu/kernel_cl.hpp
index a285a11440..74a71db725 100644
--- a/stan/math/gpu/kernel_cl.hpp
+++ b/stan/math/gpu/kernel_cl.hpp
@@ -71,7 +71,7 @@ struct range_2d_kernel {
: make_functor(name, source) {}
auto operator()(int rows, int cols, Args... args) const {
auto f = make_functor();
- cl::EnqueueArgs eargs(opencl_context.queue(), cl::NDRange(rows, cols));
+ cl::EnqueueArgs eargs(cl::NDRange(rows, cols));
f(eargs, args..., rows, cols).wait();
}
};
diff --git a/stan/math/gpu/matrix_gpu.hpp b/stan/math/gpu/matrix_gpu.hpp
index bf81a5d7c3..17513f9f4b 100644
--- a/stan/math/gpu/matrix_gpu.hpp
+++ b/stan/math/gpu/matrix_gpu.hpp
@@ -50,11 +50,10 @@ class matrix_gpu {
if (A.size() == 0)
return;
// the context is needed to create the buffer object
- cl::Context& ctx = opencl_context.context();
try {
// creates a read&write object for "size" double values
// in the provided context
- oclBuffer_ = cl::Buffer(ctx, CL_MEM_READ_WRITE, sizeof(double) * size());
+ oclBuffer_ = cl::Buffer(CL_MEM_READ_WRITE, sizeof(double) * size());
opencl_kernels::copy(rows_, cols_, A.buffer(), this->buffer());
} catch (const cl::Error& e) {
@@ -74,10 +73,9 @@ class matrix_gpu {
*/
matrix_gpu(int rows, int cols) : rows_(rows), cols_(cols) {
if (size() > 0) {
- cl::Context& ctx = opencl_context.context();
try {
// creates the OpenCL buffer of the provided size
- oclBuffer_ = cl::Buffer(ctx, CL_MEM_READ_WRITE,
+ oclBuffer_ = cl::Buffer(CL_MEM_READ_WRITE,
sizeof(double) * rows_ * cols_);
} catch (const cl::Error& e) {
check_opencl_error("matrix constructor", e);
@@ -99,13 +97,12 @@ class matrix_gpu {
explicit matrix_gpu(const Eigen::Matrix<double, R, C>& A)
: rows_(A.rows()), cols_(A.cols()) {
if (size() > 0) {
- cl::Context& ctx = opencl_context.context();
cl::CommandQueue& queue = opencl_context.queue();
cl::CommandQueue& queue = opencl_context.queue();
try {
// creates the OpenCL buffer to copy the Eigen
// matrix to the OpenCL device
oclBuffer_
- = cl::Buffer(ctx, CL_MEM_READ_WRITE, sizeof(double) * A.size());
+ = cl::Buffer(CL_MEM_READ_WRITE, sizeof(double) * A.size());
/**
* Writes the contents of A to the OpenCL buffer
* starting at the offset 0. @rok-cesnovar Do you have a sense of why we'd use a non-default context? I'm mostly just noting that we could simplify further (and slightly improve performance due to the lack of singleton access) if we didn't need to use our own context. But I have no experience here with contexts and queues or why you'd need non-default ones. So very happy to defer to you. |
I'm looking at the code here that builds the default context https://github.com/stan-dev/math/blob/develop/lib/opencl_1.2.8/CL/cl.hpp#L2672 The only difference I see is that we let the user select which device they want where the default grabs all of them. https://github.com/stan-dev/math/blob/develop/stan/math/gpu/opencl_context.hpp#L102 I don't like the fact that I had to go look into the code to figure out what is in the default context. I get the singleton is not as performant and we get some niceties from using the default. But:
|
Yeah, the problem with the default context is that we have no control which device is selected for the context. So we could end up running on the CPU or the integrated GPU as opposed to a NVIDIA/AMD GPU. |
Seems reasonable. Sounds like we want NDRange and to continue using the custom context and queue. Anyone want to take a crack at implementing this? or should I? |
If you have time today that would be super rad. I need to start making slides for stancon but should have some time tonight |
I can also take a crack at it tomorrow morning and you can double check me if I did something stupid. |
^nice! Yeah let's work on it over here |
@seantalts which test were you running to check this? I'm trying to run the transpose test and kernel_cl test and getting |
Also is there a way to unsquash commits? I was going to go backwards and see what the diff between the last commit and this one was to figure out why it's failing but it looks like you squashed them all here |
Oh I see there is already a NOLINT comment on the semicolons... hmm |
Looks great! Do you guys think we should add the kernel doc strings to our |
stan/math/gpu/kernel_cl.hpp
Outdated
auto f = make_functor(); | ||
cl::EnqueueArgs eargs(opencl_context.queue(), thread_block_size); | ||
f(eargs, args...).wait(); | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The above semicolon is problematic for lint even with // NOLINT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, haha. We should just take that out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought I tried and that gave me errors. I tried it again and it seems fine as it should. Idk what it was doing before tbh :D
I wasnt clear where exactly the problem with the semicolon arises. I made a review comment. Agree with the You mean something like
I could be a good idea as it would help the developer to have all the kernel info in one place (no need to open the .cl files). |
Also concur!
I thought at first the stringify was blocking the doxygen from being made, but I actually think we just need to add https://github.com/stan-dev/math/blob/develop/doxygen/doxygen.cfg#L774 I'm not against duplicating the docs here tho' |
Could we write a short doc that links to the kernels doc instead of a full copy/paste? |
Yeah, if there is doxygen doc being generated for the kernels and we can just link to it that'd be awesome. |
I have to do some work stuff right now but tonight / later I can do that |
Sweet. I can also take a go at it if you are busy, especially if you know the linking format offhand. |
From here it looks like it's just
One prob tho' with the kernel docs rn, idt they will sit in the namespace we want ( |
Playing around with the // \cond
STRINGIFY(
// \endcond
...code
// \cond
)
// \endcond so the
I think there is probably a way to use doxygen's preprocessor so we don't need to put |
What about some form of grouping for the kernels? https://www.stack.nl/~dimitri/doxygen/manual/grouping.html |
…'s within each kernel file. Doxygen works but the object holding the kernel code is undocumented. All of the kernel structs are moved into the respective kernel file
So I got this working for the docs and made a couple more changes
So now The main cost here is that I had to do a kind of gross looking thing with // \cond
const char *add_kernel_code = STRINGIFY(
// \endcond
... code
// \cond
);
// \endcond Which is kind of gross but idt it is that bad. The main bad thing about this is that the A 'not as bad but still bad' is that I'm no-good at doxygen and so the doc for each kernel struct just links to the file (though only the kernel and the struct are in each file so idt that's too too gross). This is passing on my local. How do you both feel about this? |
Nice! I don't think we can avoid this |
By user I meant a developer that would want to add new GPU kernels. |
Very creative! Looks much better now. I'm really excited about the end result here, I think it's pretty elegant. The only thing I'd say is that now there is a global Your doxygen wizardry is very cool. I am okay with the cond thing - it doesn't have to be perfect and the generated docs are nice. |
Ty! Is everyone cool with this? Two thumbs up then lets merge this to the original branch |
Can we move the STRINGIFYs at some point? |
Into the opencl_kernels header? Yeah sure that's reasonable. I'm at the gym atm but will change that then merge this to the original branch** |
I'll take care of it. Master? or |
Thanks! Srry typo meant bstatcomp:kernel_cl |
I lied, I can't merge it into |
Done! |
No description provided.