[WIP] [WIP] [WIP] a way of kerneling #7

seantalts · 2018-08-18T22:05:09Z

No description provided.

SteveBronder · 2018-08-19T00:09:39Z

I started messing with this and like how it looks to users. And the more I look at it the more it makes more sense. I tried putting variadic templating around the operator and make_functor and got it working. But I couldn't figure out the templating for the row and column range that we put into cl::NDRange()

  template<typename... Args>
  auto make_functor() const {
    return cl::make_kernel<Args...>(
        kernel_cl::compile(name_, source_));
  };
  // I couldn't figure out a better place to put row_range and col_range
  template<typename... Args>
  void operator()(int row_range, int col_range, Args&&... args) const {
    auto f = make_functor<Args...>();
    f(cl::EnqueueArgs(opencl_context.queue(), cl::NDRange(row_range, col_range)),
      args...).wait();
  }

seantalts · 2018-08-19T07:24:53Z

Sweet. I also played around with different variadic template argument ideas, but it's a delicate balance between 1) ease for kernel developers 2) ease for normal math library users 3) configuration for power users or whoever will come in and set their own EnqueueArgs that Rok was talking about.

In some sense what you've got above is an abstraction at a slightly higher level than unary_kernel, (but probably not the highest level one; there will be functions where the EnqueueArgs are more complicated) - we would still want a unary_kernel that inherits from that and puts in src and dest as the two matrix arguments. That way definitions for copy and transpose become very simple and provide good IDE support with the named arguments. I couldn't figure out any use case for a higher level abstraction layer like that - from what I can tell, there is no code that needs to consume generic GPU kernels without knowing the signatures involved...? Or at least nothing that couldn't be built roughly as easily with templates or the functors. I'm not sure though, can you think of anything like that? The only thing I can think of is some code that abstracts commonalities between unary and binary (and we'll probably have more than that, actually) kernels... but given that it's a lot of complication and inheritance+templating to abstract away like on net ~2-3 lines, I'm not sure if it'd be worth it... Maybe it will become more clear as the rest of the kernel accessors are translated to this, if we want to move forward with that?

seantalts · 2018-08-19T08:26:50Z

I played around with it more and started adding another kernel and I see why it's attractive to try to abstract more. Here are three options from most to least shared code:

template <typename ...FnArgs>
class basic_kernel {
 private:
  const char* name_;
  const char* source_;

 public:
  basic_kernel(const char* name, const char* source): name_(name), source_(source) {};
  auto make_functor() const {
    return cl::make_kernel<FnArgs...>(
        kernel_cl::compile(name_, source_));
  };
  template <typename ...OpArgs>
  void operator()(int rows, int cols, OpArgs... args) const {
    auto f = make_functor();
    f(cl::EnqueueArgs(opencl_context.queue(), cl::NDRange(rows, cols)),
      args...).wait();
  }
};

class unary_kernel : public basic_kernel<cl::Buffer, cl::Buffer, int, int> {
 public:
  unary_kernel(const char* name, const char* source): basic_kernel(name, source) {};
  void operator()(cl::Buffer src, cl::Buffer dest, int rows, int cols) const {
    basic_kernel::operator()<cl::Buffer, cl::Buffer, int, int>(
        rows, cols, src, dest, rows, cols);
  }
};


class self_modifying_kernel : public basic_kernel<cl::Buffer, int, int> {
  self_modifying_kernel(const char* name, const char* source) :
      basic_kernel(name, source) {};
  void operator()(cl::Buffer src, int rows, int cols) const {
    basic_kernel::operator()<cl::Buffer, int, int>(
        rows, cols, src, rows, cols);
  }
};

class unary_kernel2 : public basic_kernel<cl::Buffer, cl::Buffer, int, int> {
  unary_kernel2(const char* name, const char* source): basic_kernel(name, source) {};
  void operator()(cl::Buffer src, cl::Buffer dest, int rows, int cols) const {
    auto f = make_functor();
    cl::EnqueueArgs eargs(opencl_context.queue(), cl::NDRange(rows, cols));
    f(eargs, src, dest, rows, cols).wait();
  }
};

class self_modifying_kernel2 : public basic_kernel<cl::Buffer, int, int> {
  self_modifying_kernel2(const char* name, const char* source)
      : basic_kernel(name, source) {};
  void operator()(cl::Buffer src, int rows, int cols) const {
    auto f = make_functor();
    cl::EnqueueArgs eargs(opencl_context.queue(), cl::NDRange(rows, cols));
    f(eargs, src, rows, cols).wait();
  }
};

class unary_kernel3 {
 private:
  const char* name_;
  const char* source_;

 public:
  unary_kernel3(const char* name, const char* source): name_(name), source_(source) {};
  auto make_functor() const {
    return cl::make_kernel<cl::Buffer, cl::Buffer, int, int>(
        kernel_cl::compile(name_, source_));
  };
  void operator()(cl::Buffer src, cl::Buffer dest, int rows, int cols) const {
    auto f = make_functor();
    f(cl::EnqueueArgs(opencl_context.queue(), cl::NDRange(rows, cols)),
      src, dest, rows, cols).wait();
  }
};

class self_modifying_kernel3 {
 private:
  const char* name_;
  const char* source_;

 public:
  self_modifying_kernel3(const char* name, const char* source)
      : name_(name), source_(source) {};
  auto make_functor() const {
    return cl::make_kernel<cl::Buffer, int, int>(
        kernel_cl::compile(name_, source_));
  };
  void operator()(cl::Buffer src, int rows, int cols) const {
    auto f = make_functor();
    f(cl::EnqueueArgs(opencl_context.queue(), cl::NDRange(rows, cols)),
      src, rows, cols).wait();
  }
};

I think 1 is very confusing, I'm leaning towards 2 or 3 (not trying to abstract operator() at least). I know 3 has all that redundancy but honestly redundancy like that seems ugly but inheritance is really expensive mentally and often just not worth it. We're close to that border here with just 2 fields and a one line function saved. What do you think?

seantalts · 2018-08-19T08:47:29Z

Oh man, I was trying to figure out how to explain all the ~*complicated emotions*~ I have about inheritance and I remembered that a common catechism is "Always prefer composition to inheritance" and then I was like, hm, how can we use composition here, and there is this which I quite like:

template <typename ...Args>
class kernel_functor {
 private:
  const char* name_;
  const char* source_;

 public:
  kernel_functor(const char* name, const char* source)
      : name_(name), source_(source) {};

  auto operator()() const {
    return cl::make_kernel<Args...>(kernel_cl::compile(name_, source_));
  };
};

struct unary_kernel4 {
  kernel_functor<cl::Buffer, cl::Buffer, int, int> make_functor;
  unary_kernel4(const char* name, const char* source) : make_functor(name, source) {};
  void operator()(cl::Buffer src, cl::Buffer dest, int rows, int cols) const {
    auto f = make_functor();
    cl::EnqueueArgs eargs(opencl_context.queue(), cl::NDRange(rows, cols));
    f(eargs, src, dest, rows, cols).wait();
  }
};

Now make_functor is a field on a kernel (unary or otherwise) that is populated by the constructor. Since the constructors always need to be defined for subclasses anyway, this introduces pretty much no extra code over the inheritance option and it keeps the responsibilities separate and clear and modulated by the interface to the functor.

The ugliness in trying to abstract operator() happens because of the need to extract out the number of rows and columns somehow. I don't think there's a good way around it but I'm pretty happy to repeat those 3 lines of code for each kernel type.

seantalts · 2018-08-19T11:04:21Z

I think if we wanted to do a more abstract thing (maybe we do?) we'd want to rewrite all of the applicable kernels so that the parameters we need to make the NDRange (most often or maybe always the rows and cols) come first (last wouldn't work with parameter packs). Then it'd be pretty easy... Maybe we should do that?

rok-cesnovar · 2018-08-19T13:17:51Z

These really looks nice! Thanks Sean for the efforts. Really apprecitated!!
I was away yesterday afternoon so just getting through this now and I like it a lot.

If rewriting would make things easier I am all for it.

We will mostly have the rows&cols NDRange parameters, we do sometimes have 1D or 3D ranges however.

What if the first parameter was simply cl::NDRange? And we would could also have 2 cl::NDRange parameters if we had kernels where we want to specify the thread block size (work-group size)?

SteveBronder · 2018-08-20T03:59:14Z

So far I like it! Agree with rock on having the first arg be NDRange

seantalts · 2018-08-20T08:44:00Z

Hm, that could work. I want to write out a little bit of stuff I learned by playing with it all and then have a question about the NDRange arg.

The first parameter to the KernelFunctorThing that OpenCL gives us is an EnqueueArgs which has lots of constructors to allow for setting many different options easily. This is a nice structure intended to basically do what we want to do here. The only reason in my head that we don't use that directly is because we actually aren't using the default context or default command queue, and so we always have to explicitly pass the one in from opencl_context.queue(). If we WERE setting and using the default context, I think we could pretty much just use the functor directly and allow people to construct an EnqueueArgs with the appropriate NDRanges. But then we've coded ourself into a corner where suddenly having different contexts or command queues on something between a global and use-site basis is a little bit more of a pain in the ass. This is maybe fine since it seems unlikely we'd have groups of kernels using one context and some groups using another...?
It'd be nice to somehow encode that we pretty much always want to use add et al with this default NDRange(rows, cols) (Well first of all, is that a correct statement?). So having a small chunk of code that encapsulates that we're using the opencl_context.queue() and the default ranges seems good if we can manage it, though I guess all of these are going to be wrapped up by something that actually takes in a matrix_gpu anyway, right?

rok-cesnovar · 2018-08-20T09:41:17Z

This is a really really unlikely case that we would be handling multiple contexts with different groups.
The add kernel will always be called with (rows,cols) threads, The same goes for all the current basic matrix kernels that are done in this manner (1 thread -> 1 matrix element).

seantalts · 2018-08-20T21:51:18Z

I see that we're creating a non-default queue (and context?) - maybe we can find a way to set those to default and then not have to access them from the opencl_context singleton and pass them anymore...

re: NDRange first arg - it is bothering slightly in the kind of code golf perfectionist way that someone would call add like this:

opencl_kernel::add(cl::NDRange(rows, cols), A, B, C, rows, cols);

instead of

opencl_kernel::add(rows, cols, A, B, C);

seantalts · 2018-08-20T22:36:40Z

I put up partial refactoring towards the 2nd version. It's not so bad. What do you guys think of these two options? Should we just go with something? Sorry for spending so much time messing around with this interface.

SteveBronder · 2018-08-21T03:27:24Z

I went through the OpenCL C++ API docks and I see a mention of the default context, but I don't see a lot of info about it. Can you show me where you are seeing that?

I'm personally more in favor of the first option with cl::NDRange(). @rok-cesnovar are there times we would not want add to have a range based on the rows and columns for the first matrix? If that does not exist then I'm cool with the 2nd version

SteveBronder · 2018-08-21T04:14:09Z

Also totally no worries, I really appreciate the effort! It's awesome to have your input on the design

seantalts · 2018-08-21T07:58:31Z

@rok-cesnovar are there times we would not want add to have a range based on the rows and columns for the first matrix?

Hm, I'm confused about add now - the doc says

 * @param rows Number of rows for matrix A.
 * @param cols Number of rows for matrix B.

isn't this supposed to just add two matrices of the same size together? if not, wouldn't you need more information to be able to select which parts of the larger matrix to use?
why does it say rows are from A and cols are from B? haha

seantalts · 2018-08-21T08:00:04Z

PS if you guys prefer the NDRange version I'm happy with that too. I'm a bit on the fence and was just going with the 2nd because it's slightly less typing, but maybe it's also more confusing?

rok-cesnovar · 2018-08-21T08:14:23Z

These is an error in the doc :(, I apologize for that. And thanks for finding it. I think this is some copy paste that was never fixed :/

I would maybe also prefer the 1st option as we would be able to use it for kernel like the multiply_diagonal_with_scalar (which is coming up) that spawns A.rows() threads where each thread multiplies one diagonal element. We could simply use ( NDRange( A.rows() ), A, rows, cols) ).

Also for the GPU kernels like the cov_exp_quad one that spawn x.size() times x.size() threads, we would also be able to reuse it with NDRange( x.size(), x.size() ), x, x.size()

seantalts · 2018-08-21T08:44:23Z

Sounds good!

On Tue, Aug 21, 2018 at 10:14 Rok Češnovar ***@***.***> wrote: These is an error in the doc :(, I apologize for that. And thanks for finding it. I think this is some copy paste that was never fixed :/ I would maybe also prefer the 1st option as we would be able to use it for kernel like the multiply_diagonal_with_scalar (which is coming up) that spawns A.rows() threads where each thread multiplies one diagonal element. We could simply use ( NDRange( A.rows() ), A, rows, cols) ). Also for the GPU kernels like the cov_exp_quad one that spawn x.size() times x.size() threads, we would also be able to reuse it with NDRange( x.size(), x.size() ), x, x.size() — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#7 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAxJ7FhEfzWC_d5wUzqHpgHNkPsUy0G-ks5uS8FggaJpZM4WCvZC> .

-- (phoning it in ;))

seantalts · 2018-08-21T13:18:30Z

I'm not sure about whether or not we'd want to use the default context... but I saw here that we could use the default context and queue in a bunch of places and just avoid passing arguments and just use the functor creation thing maybe pretty directly. Concretely, here's how the code changes:

diff --git a/stan/math/gpu/kernel_cl.hpp b/stan/math/gpu/kernel_cl.hpp
index a285a11440..74a71db725 100644
--- a/stan/math/gpu/kernel_cl.hpp
+++ b/stan/math/gpu/kernel_cl.hpp
@@ -71,7 +71,7 @@ struct range_2d_kernel {
       : make_functor(name, source) {}
   auto operator()(int rows, int cols, Args... args) const {
     auto f = make_functor();
-    cl::EnqueueArgs eargs(opencl_context.queue(), cl::NDRange(rows, cols));
+    cl::EnqueueArgs eargs(cl::NDRange(rows, cols));
     f(eargs, args..., rows, cols).wait();
   }
 };

diff --git a/stan/math/gpu/matrix_gpu.hpp b/stan/math/gpu/matrix_gpu.hpp
index bf81a5d7c3..17513f9f4b 100644
--- a/stan/math/gpu/matrix_gpu.hpp
+++ b/stan/math/gpu/matrix_gpu.hpp
@@ -50,11 +50,10 @@ class matrix_gpu {
     if (A.size() == 0)
       return;
     // the context is needed to create the buffer object
-    cl::Context& ctx = opencl_context.context();
     try {
       // creates a read&write object for "size" double values
       // in the provided context
-      oclBuffer_ = cl::Buffer(ctx, CL_MEM_READ_WRITE, sizeof(double) * size());
+      oclBuffer_ = cl::Buffer(CL_MEM_READ_WRITE, sizeof(double) * size());
 
       opencl_kernels::copy(rows_, cols_, A.buffer(), this->buffer());
     } catch (const cl::Error& e) {
@@ -74,10 +73,9 @@ class matrix_gpu {
    */
   matrix_gpu(int rows, int cols) : rows_(rows), cols_(cols) {
     if (size() > 0) {
-      cl::Context& ctx = opencl_context.context();
       try {
         // creates the OpenCL buffer of the provided size
-        oclBuffer_ = cl::Buffer(ctx, CL_MEM_READ_WRITE,
+        oclBuffer_ = cl::Buffer(CL_MEM_READ_WRITE,
                                 sizeof(double) * rows_ * cols_);
       } catch (const cl::Error& e) {
         check_opencl_error("matrix constructor", e);
@@ -99,13 +97,12 @@ class matrix_gpu {
   explicit matrix_gpu(const Eigen::Matrix<double, R, C>& A)
       : rows_(A.rows()), cols_(A.cols()) {
     if (size() > 0) {
-      cl::Context& ctx = opencl_context.context();
       cl::CommandQueue& queue = opencl_context.queue();
       cl::CommandQueue& queue = opencl_context.queue();
       try {
         // creates the OpenCL buffer to copy the Eigen
         // matrix to the OpenCL device
         oclBuffer_
-            = cl::Buffer(ctx, CL_MEM_READ_WRITE, sizeof(double) * A.size());
+            = cl::Buffer(CL_MEM_READ_WRITE, sizeof(double) * A.size());
         /**
          * Writes the contents of A to the OpenCL buffer
          * starting at the offset 0.

@rok-cesnovar Do you have a sense of why we'd use a non-default context? I'm mostly just noting that we could simplify further (and slightly improve performance due to the lack of singleton access) if we didn't need to use our own context. But I have no experience here with contexts and queues or why you'd need non-default ones. So very happy to defer to you.

SteveBronder · 2018-08-21T14:05:53Z

I'm looking at the code here that builds the default context

https://github.com/stan-dev/math/blob/develop/lib/opencl_1.2.8/CL/cl.hpp#L2672

The only difference I see is that we let the user select which device they want where the default grabs all of them.

https://github.com/stan-dev/math/blob/develop/stan/math/gpu/opencl_context.hpp#L102

I don't like the fact that I had to go look into the code to figure out what is in the default context. I get the singleton is not as performant and we get some niceties from using the default. But:

The default context is not well documented
Not being explicit in where we are sending things could be confusing to people in the future
I'm unsure whether we would want specific things inside of our context and queue in the future. Because of that I'd rather have our own rather than tracing back to it in the future.

rok-cesnovar · 2018-08-21T15:32:17Z

Yeah, the problem with the default context is that we have no control which device is selected for the context.

So we could end up running on the CPU or the integrated GPU as opposed to a NVIDIA/AMD GPU.

seantalts · 2018-08-21T18:30:17Z

Seems reasonable. Sounds like we want NDRange and to continue using the custom context and queue. Anyone want to take a crack at implementing this? or should I?

SteveBronder · 2018-08-21T18:48:32Z

If you have time today that would be super rad. I need to start making slides for stancon but should have some time tonight

rok-cesnovar · 2018-08-21T18:53:05Z

I can also take a crack at it tomorrow morning and you can double check me if I did something stupid.
Should we do it on this branch and then move things to the PR branch?

SteveBronder · 2018-08-21T18:53:31Z

^nice! Yeah let's work on it over here

SteveBronder · 2018-08-22T02:13:57Z

@seantalts which test were you running to check this? I'm trying to run the transpose test and kernel_cl test and getting CL_INVALID_KERNEL_NAME

SteveBronder · 2018-08-22T02:18:11Z

Also is there a way to unsquash commits? I was going to go backwards and see what the diff between the last commit and this one was to figure out why it's failing but it looks like you squashed them all here

seantalts · 2018-08-22T11:23:14Z

Oh I see there is already a NOLINT comment on the semicolons... hmm

seantalts · 2018-08-22T11:26:56Z

Looks great! Do you guys think we should add the kernel doc strings to our global_range_kernel instances somehow? I could go either way.

rok-cesnovar · 2018-08-22T11:29:25Z

stan/math/gpu/kernel_cl.hpp

+    auto f = make_functor();
+    cl::EnqueueArgs eargs(opencl_context.queue(), thread_block_size);
+    f(eargs, args...).wait();
+  };


The above semicolon is problematic for lint even with // NOLINT

Oh, haha. We should just take that out.

I thought I tried and that gave me errors. I tried it again and it seems fine as it should. Idk what it was doing before tbh :D

rok-cesnovar · 2018-08-22T11:36:08Z

I wasnt clear where exactly the problem with the semicolon arises. I made a review comment.

Agree with the global_range_kernel name!

You mean something like

/**
 * Makes an identity matrix on the GPU
 *
 * @param range the thread organization to execute the kernel with
 * @param[in,out] A The identity matrix output.
 * @param rows The number of rows for A.
 * @param cols The number of cols for A.
 *
 * @note This kernel uses the helper macros available in helpers.cl.
 */
const range_2d_kernel<cl::Buffer, int, int> identity("identity",
#include <stan/math/gpu/kernels/identity_matrix.cl>  // NOLINT
);  // NOLINT

I could be a good idea as it would help the developer to have all the kernel info in one place (no need to open the .cl files).

SteveBronder · 2018-08-22T14:30:54Z

Agree with the global_range_kernel name!

Also concur!

I could be a good idea as it would help the developer to have all the kernel info in one place (no need to open the .cl files).

I thought at first the stringify was blocking the doxygen from being made, but I actually think we just need to add .cl to this line in the doxygen.cfg

https://github.com/stan-dev/math/blob/develop/doxygen/doxygen.cfg#L774

I'm not against duplicating the docs here tho'

SteveBronder · 2018-08-22T14:36:56Z

Could we write a short doc that links to the kernels doc instead of a full copy/paste?

seantalts · 2018-08-22T14:55:32Z

Yeah, if there is doxygen doc being generated for the kernels and we can just link to it that'd be awesome.

SteveBronder · 2018-08-22T15:52:52Z

I have to do some work stuff right now but tonight / later I can do that

seantalts · 2018-08-22T16:33:38Z

Sweet. I can also take a go at it if you are busy, especially if you know the linking format offhand.

SteveBronder · 2018-08-22T18:01:21Z

From here it looks like it's just

<functionName>"("<argument-list>")"

One prob tho' with the kernel docs rn, idt they will sit in the namespace we want (stan::math) since they are only imported as strings and not specified in those namespaces

https://www.stack.nl/~dimitri/doxygen/manual/autolink.html

SteveBronder · 2018-08-22T18:40:17Z

Playing around with the .cl files I got the docs to compile by doing

// \cond
STRINGIFY(
// \endcond
...code
// \cond
)
// \endcond

so the /cond and \endcond tells doxygen to ignore the stringify and document the code. Then we can include a link to the kernel code with something like

/**
 * See the docs for \link identity_matrix.identity \endlink
 */

I think there is probably a way to use doxygen's preprocessor so we don't need to put \cond everywhere.

seantalts · 2018-08-22T23:11:11Z

What about some form of grouping for the kernels? https://www.stack.nl/~dimitri/doxygen/manual/grouping.html

…'s within each kernel file. Doxygen works but the object holding the kernel code is undocumented. All of the kernel structs are moved into the respective kernel file

SteveBronder · 2018-08-23T01:44:54Z

So I got this working for the docs and made a couple more changes

I changed the kernel files so that the kernel's string literal reads into a const char* name_kernel_code. I changed the file names to end in .hpp so the kernel files are regular header files now. That got rid of the #include <blah/blah> // NOLINT thing we were passing to the kernel struct. I started doing this because the #include made the doxygen look very gross but I think it makes sense to do this anyway.
Because those new .hpp files were just making that char we have to import all of those kernels into kernel_cl at the top which looked kind of bad. So I did the reverse and imported kernel_cl into each kernel file and defined the struct there.

So now #include <stan/math/gpu/kernels/copy.hpp> gives access to stan::math::opencl_kernels::copy()

The main cost here is that I had to do a kind of gross looking thing with \cond i.e.

// \cond
const char *add_kernel_code = STRINGIFY(
    // \endcond
... code
// \cond
);
// \endcond

Which is kind of gross but idt it is that bad. The main bad thing about this is that the add_kernel_code will not have docs. But we can include in the doc for each kernel that the code actually sits in name_kernel_code

A 'not as bad but still bad' is that I'm no-good at doxygen and so the doc for each kernel struct just links to the file (though only the kernel and the struct are in each file so idt that's too too gross).

This is passing on my local. How do you both feel about this?

rok-cesnovar · 2018-08-23T09:01:07Z

Nice!
Getting red of the #include stuff is a big plus.
One benefit of your new design is also that is simplifies the process/instructions because the user does not need to change the kernel_cl.hpp file.

I don't think we can avoid this \cond and \endcond anyways if we want the kernel docs.

rok-cesnovar · 2018-08-23T09:01:49Z

By user I meant a developer that would want to add new GPU kernels.

seantalts · 2018-08-23T09:16:34Z

Very creative! Looks much better now. I'm really excited about the end result here, I think it's pretty elegant.

The only thing I'd say is that now there is a global STRINGIFYmacro in the codebase. I think this is probably fine. I think you should be able to have a single definition in kernel_cl.hpp.

Your doxygen wizardry is very cool. I am okay with the cond thing - it doesn't have to be perfect and the generated docs are nice.

SteveBronder · 2018-08-23T09:40:52Z

Ty!

Is everyone cool with this? Two thumbs up then lets merge this to the original branch

seantalts · 2018-08-23T09:53:13Z

Can we move the STRINGIFYs at some point?

SteveBronder · 2018-08-23T11:49:12Z

Into the opencl_kernels header? Yeah sure that's reasonable. I'm at the gym atm but will change that then merge this to the original branch**

seantalts · 2018-08-23T11:58:30Z

I'll take care of it. Master? or bstatcomp:kernel_cl?

SteveBronder · 2018-08-23T12:01:21Z

Thanks! Srry typo meant bstatcomp:kernel_cl

seantalts · 2018-08-23T12:16:44Z

I lied, I can't merge it into bstatcomp:kernel_cl - one of you want to do it? Then I can approve and merge the PR on stan-dev :D

SteveBronder · 2018-08-23T12:19:05Z

Done!

seantalts mentioned this pull request Aug 19, 2018

Change how users access OpenCL kernels stan-dev/math#966

Merged

6 tasks

seantalts force-pushed the kcl branch from f0f8c23 to bc510eb Compare August 19, 2018 11:53

snapshot of a work in progress design experiment

0b4f512

seantalts force-pushed the kcl branch from bc510eb to 0b4f512 Compare August 20, 2018 22:34

rok-cesnovar reviewed Aug 22, 2018

View reviewed changes

rok-cesnovar added 2 commits August 22, 2018 14:02

changed to global_range, removed semicolons :)

6f8bd4a

...

fba61eb

SteveBronder added 2 commits August 22, 2018 21:24

Changes all the kernel files so that they are placed into const char*…

aadb3ce

…'s within each kernel file. Doxygen works but the object holding the kernel code is undocumented. All of the kernel structs are moved into the respective kernel file

fix lint issues

4189530

Move STRINGIFY to a single location in kernel_cl.hpp

6a2014c

SteveBronder merged commit 3625c8f into bstatcomp:kernel_cl Aug 23, 2018

rok-cesnovar deleted the kcl branch December 1, 2019 13:56

[WIP] [WIP] [WIP] a way of kerneling #7

[WIP] [WIP] [WIP] a way of kerneling #7

Conversation

seantalts commented Aug 18, 2018

SteveBronder commented Aug 19, 2018

seantalts commented Aug 19, 2018

seantalts commented Aug 19, 2018 • edited Loading

seantalts commented Aug 19, 2018 • edited Loading

seantalts commented Aug 19, 2018 • edited Loading

rok-cesnovar commented Aug 19, 2018 • edited Loading

SteveBronder commented Aug 20, 2018

seantalts commented Aug 20, 2018

rok-cesnovar commented Aug 20, 2018

seantalts commented Aug 20, 2018

seantalts commented Aug 20, 2018

SteveBronder commented Aug 21, 2018

SteveBronder commented Aug 21, 2018

seantalts commented Aug 21, 2018

seantalts commented Aug 21, 2018

rok-cesnovar commented Aug 21, 2018

seantalts commented Aug 21, 2018 via email

seantalts commented Aug 21, 2018

SteveBronder commented Aug 21, 2018

rok-cesnovar commented Aug 21, 2018

seantalts commented Aug 21, 2018

SteveBronder commented Aug 21, 2018

rok-cesnovar commented Aug 21, 2018

SteveBronder commented Aug 21, 2018

SteveBronder commented Aug 22, 2018

SteveBronder commented Aug 22, 2018

seantalts commented Aug 22, 2018

seantalts commented Aug 22, 2018

rok-cesnovar Aug 22, 2018

Choose a reason for hiding this comment

seantalts Aug 22, 2018

Choose a reason for hiding this comment

rok-cesnovar Aug 22, 2018

Choose a reason for hiding this comment

rok-cesnovar commented Aug 22, 2018

SteveBronder commented Aug 22, 2018

SteveBronder commented Aug 22, 2018

seantalts commented Aug 22, 2018

SteveBronder commented Aug 22, 2018

seantalts commented Aug 22, 2018

SteveBronder commented Aug 22, 2018

SteveBronder commented Aug 22, 2018 • edited Loading

seantalts commented Aug 22, 2018

SteveBronder commented Aug 23, 2018

rok-cesnovar commented Aug 23, 2018

rok-cesnovar commented Aug 23, 2018

seantalts commented Aug 23, 2018

SteveBronder commented Aug 23, 2018

seantalts commented Aug 23, 2018

SteveBronder commented Aug 23, 2018 • edited Loading

seantalts commented Aug 23, 2018

SteveBronder commented Aug 23, 2018

seantalts commented Aug 23, 2018

SteveBronder commented Aug 23, 2018

seantalts commented Aug 19, 2018 •

edited

Loading

seantalts commented Aug 19, 2018 •

edited

Loading

seantalts commented Aug 19, 2018 •

edited

Loading

rok-cesnovar commented Aug 19, 2018 •

edited

Loading

SteveBronder commented Aug 22, 2018 •

edited

Loading

SteveBronder commented Aug 23, 2018 •

edited

Loading