Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eigen member function for reading fvar values and derivatives #1745

Closed
andrjohns opened this issue Feb 25, 2020 · 33 comments
Closed

Eigen member function for reading fvar values and derivatives #1745

andrjohns opened this issue Feb 25, 2020 · 33 comments
Milestone

Comments

@andrjohns
Copy link
Collaborator

andrjohns commented Feb 25, 2020

Description

As first raised in this issue it would be useful to have an Eigen function for reading a matrix of fvar<T> into two matrices of type T. By implementing this through Eigen, it can be achieved through traversing the matrix only once and possibly take advantage of Eigen's SIMD vectorisation facilities

This would be used like the following:

  MatrixXd vals(100, 100);
  MatrixXd derivs(100, 100);

  fvar_mat.read_fvar(vals, derivs);

Current Version:

v3.1.0

@t4c1
Copy link
Contributor

t4c1 commented Feb 25, 2020

What you did not specify here is that assigning to this this expression should modify the underlying matrices. That would be the advantage over to_fvar.

@andrjohns
Copy link
Collaborator Author

andrjohns commented Feb 25, 2020 via email

@t4c1
Copy link
Contributor

t4c1 commented Feb 25, 2020

I meant the example I posted in the linked issue. I'm making a bit more complete here:

Eigen::MatrixXd vals(10,10), derivs(10,10);
Eigen::Matrix<fvar<double>,Eigen::Dynamic, Eigen::Dynamic> x(10,10);
vals.make_fvar(derivs) = x;

EXPECT_EQ(x(0,0).val_, vals(0,0));
EXPECT_EQ(x(0,0).d_, derivs(0,0));
...

@andrjohns
Copy link
Collaborator Author

Oh I misunderstood what you meant in the issue, you want to create two matrices of doubles from a single matrix of fvars, not the other way around?

@t4c1
Copy link
Contributor

t4c1 commented Feb 25, 2020

Exactley. Though I also would not mind a generalized to_fvar that would work the other way around.

@andrjohns
Copy link
Collaborator Author

That makes sense, I'll look into it

@andrjohns
Copy link
Collaborator Author

The simplest way to add it (that I can see) would be a member function like the following:

template<typename OtherDerived>
void read_fvar(OtherDerived& other1, OtherDerived& other2) {
  for(size_t i = 0; i < derived().size(); ++i) {
    other1.derived().coeffRef(i) = derived().coeff(i).val_;
    other2.derived().coeffRef(i) = derived().coeff(i).d_;
  }
}

Which, borrowing your example above, would be used like:

Eigen::MatrixXd vals(10,10), derivs(10,10);
Eigen::Matrix<fvar<double>,Eigen::Dynamic, Eigen::Dynamic> x(10,10);
x.read_fvar(vals, derivs);

So it would just be a single loop on the backend copying the values and derivatives into separate matrices. How does that sound?

@bob-carpenter
Copy link
Contributor

bob-carpenter commented Feb 25, 2020 via email

@t4c1
Copy link
Contributor

t4c1 commented Feb 25, 2020

I was hoping for a way that would let Eigen evaluate expression in a vectorized fashion. But that might be too hard to achieve.

Your suggestion has a few issues. First not all expressions support linear indexing. For example Eigen::Ref does not. Also I guess if you get some row major expression and col major destination linear indexing will address elements in different orders, producing wrong result.

So to be general we must index matrices in two dimensions. We can either use two loops or leave order up to Eigen in a somewhat hacky solution (I did not test that. I hope it works):

template<typename T_val, typename T_deriv>
void read_fvar(T_val& val, T_deriv& deriv) {
    val = T_val::NullaryExpr(val.rows(), val.cols,
        [this, &deriv](Eigen::Index i, Eigen::Index j){
            deriv.coeffRef(i, j) = this->coeff(i,j).d_;
            return this->coeff(i,j).val_;
        });
}

I got Idea from here: https://eigen.tuxfamily.org/dox/TopicCustomizing_NullaryExpr.html

@bob-carpenter
Copy link
Contributor

bob-carpenter commented Feb 25, 2020 via email

@andrjohns
Copy link
Collaborator Author

@t4c1 Got a NullaryExpr version working, how does this look:

template <typename OtherDerived>
auto read_fvar(OtherDerived& other1, OtherDerived& other2) {
  return Derived::NullaryExpr(derived().rows(), derived().cols(),
                              [&](Index i, Index j){
                                other1.derived().coeffRef(i,j) = derived().coeff(i,j).val_;
                                other2.derived().coeffRef(i,j) = derived().coeff(i,j).d_;
                              });
}

@t4c1
Copy link
Contributor

t4c1 commented Feb 26, 2020

I thought we need to return some value from the lambda in NullaryExpr. Can you give me an example how do you call that? Anyway if it works it is fine.

@andrjohns
Copy link
Collaborator Author

Doesn't seem to be an issue, which is nice. As for usage, given a matrix of fvar<double>, which is fvar_mat in the example below:

  MatrixXd val_out(100, 100);
  MatrixXd d_out(100, 100);

  fvar_mat.read_fvar(val_out, d_out);

@t4c1
Copy link
Contributor

t4c1 commented Feb 26, 2020

Intersting. I thought all Eigen expressions use lazy evaluations. Maybe they anticiated such use case made NullaryExpr eval eagerly if its functor returns void, even if it does not seem to be documented. Anyway this makes code easyer to read than my suggestion.

@andrjohns andrjohns changed the title Eigen member function for constructing fvar matrix Eigen member function for reading fvar values and derivatives Feb 26, 2020
@andrjohns
Copy link
Collaborator Author

Great, I'll write some tests and update some existing functions before I open a pull with the added method

@andrjohns
Copy link
Collaborator Author

I've put in an initial version for matrices of var (fvar not yet tested), comparing looping against two Eigen methods:

EigMethod (Current approach):

vals = mat.val();
adj = mat.adj();

EigMethod_New (New method):

mat.read_var(vals, adj);

Overall it looks like the new method is on-par with plain looping (which is unsurprising, since that's what its doing on the backend):

2020-04-11 12:19:03
Running ./EigCopyMethods
Run on (8 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 256 KiB (x4)
  L3 Unified 8192 KiB (x1)
Load Average: 0.58, 0.57, 0.62
---------------------------------------------------------------
Benchmark                     Time             CPU   Iterations
---------------------------------------------------------------
Matrix_EigMethod        3778165 ns      3778080 ns          172
Matrix_EigMethod_New    2610224 ns      2609719 ns          277
Matrix_LoopRMajor      19648612 ns     19652926 ns           34
Matrix_LoopCMajor       2562938 ns      2563043 ns          281

RowVec_EigMethod         160340 ns       160375 ns         4430
RowVec_EigMethod_New     128850 ns       128877 ns         5398
RowVec_Loop              130614 ns       130643 ns         5233

ColVec_EigMethod         263109 ns       263168 ns         2785
ColVec_EigMethod_New     163409 ns       163442 ns         3833
ColVec_Loop              169458 ns       169492 ns         3820

The added Eigen method is below, which checks whether the input is row- or column-major and loops appropriately:

template <typename OtherDerived>
inline decltype(auto) read_var(OtherDerived& other1, OtherDerived& other2) {
  size_t R = derived().rows();
  size_t C = derived().cols();
  if(derived().IsRowMajor) {
    for(int r = 0; r < R; ++r) {
      for(int c = 0; c < C; ++c) {
        other1.derived().coeffRef(r, c) = derived().coeffRef(r,c).vi_->val_;
        other2.derived().coeffRef(r, c) = derived().coeffRef(r,c).vi_->adj_;
      }
    }
  } else {
    for(int c = 0; c < C; ++c) {
      for(int r = 0; r < R; ++r) {
        other1.derived().coeffRef(r, c) = derived().coeffRef(r,c).vi_->val_;
        other2.derived().coeffRef(r, c) = derived().coeffRef(r,c).vi_->adj_;
      }
    }
  }
}

Overall, is looking good!

@t4c1
Copy link
Contributor

t4c1 commented Apr 11, 2020

That looks nice! I think Eigen nullaryExpr might do some more optimizations (single loop - linear indexing, where possible), but the benefit of that is likely small.

@bob-carpenter
Copy link
Contributor

Cool idea. The simd speedups can be really big if they have the right unfolding strides.

Would it be possible to do the same kind of thing with methods for var? I don't think the same speedup would be possible because of the vari* chasing.

@t4c1
Copy link
Contributor

t4c1 commented Apr 11, 2020

This is not actually using simd. The speedup comes from more efficient use of cache. We only do a single pass over the matrix of var.

@bob-carpenter
Copy link
Contributor

SSE and AVX on the chips can give you SIMD from unrolled loops that contain multiple copies of operations. Keeping the SSE and AVX pipelines full requires efficient use of cache and the right byte alignments. See the Wikipedia on SIMD.

Code written specifically to allow SSE or AVX optimizations often looks like this, say, to copy an array:

for (int i = 0; i < N; i += 4) {
  a[i] = b[i];
  a[i + 1] = b[i + 1];
  a[i + 2] = b[i + 2];
  a[i + 3] = b[i + 3];
}

It doesn't look like it'll do anything, but with SSE, AVX, etc., the compiler can target unrolling these into parallelized operations that happen 4 at a time (or even 8 at a time with the later AVX, I think). In some cases I think it can even do it automatically from a simple loop.

See, for example, Alex Barnett's C++ benchmarks for n-body problems. There are examples of using loops, OpenCL, and OpenCL with loop unrolling and Agner Fogg's vector class, plus further optimizations using specialized inverse sqrt functions (one of which is implemented in Eigen).

@t4c1
Copy link
Contributor

t4c1 commented Apr 11, 2020

Our case is not that simple - we are not copying consecutive memory locations. We are also acessing values trough different pointers. I doubt compiler can deduce they point to consecutive locations. Even if it could, we have interleaved val_s and adj_s, which we are copying to different destinations.

@bob-carpenter
Copy link
Contributor

Our case is not that simple - we are not copying consecutive memory locations.

Right. That's why I was wondering if it'd still be useful. In practice, the vari* pointers for a Matrix<var, R, C> will be consecutive, but that breaks as soon as we assign into the matrix.

@t4c1
Copy link
Contributor

t4c1 commented Apr 11, 2020

Yeah. That is why this can never be as efficient as what we can do once we have those static matrices (#1805).

@andrjohns
Copy link
Collaborator Author

@t4c1 Thanks! I tried the NullaryExpr approach, but couldn't get it to evaluate properly (always returned 0). I've got an example up on godbolt here (bit verbose since I wrote out the full functor rather than a lambda) where you can see the issue.

@bob-carpenter I ran the program through g++ with the flag -fopt-info-vec-all, and it looks like isn't being vectorised:

math/stan/math/prim/eigen_plugins.h:210:22: missed: couldn't vectorize loop
math/stan/math/prim/eigen_plugins.h:212:72: missed: not vectorized: data ref analysis failed: _356 = _348->val_;
math/stan/math/prim/eigen_plugins.h:211:24: missed: couldn't vectorize loop
math/stan/math/prim/eigen_plugins.h:212:72: missed: versioning for alias not supported for: can't determine dependence between _348->val_ and *_355

Which I think is indicating pointer overlap (where multiple pointers point to the same location in memory), which inhibits auto vectorisation (more info here)

@bob-carpenter
Copy link
Contributor

That option -fopt-info-vec-all is super cool! Thanks for sharing.

I'd guess it means the pointers could overlap, not that they necessarily do, because that info probably isn't available through static (compile time) analysis. The comment seems to indicate that, too: "can't determine dependence between 348->val and *_355"

@andrjohns
Copy link
Collaborator Author

Thats the most verbose of the flags, they have a bunch of others for specific optimizations as well: https://www.codingame.com/playgrounds/283/sse-avx-vectorization/autovectorization

@t4c1
Copy link
Contributor

t4c1 commented Apr 14, 2020

@andrjohns I got your NullaryExpr example working here. I also removed redundant .derived() calls and added single argument operator() to the functor (Eigen will use this one if linear indexing is possible).

Since you already have benchmark, I would ask you to also benchmark this implementation. It might be a bit faster, since in most cases no index artihmetic is required.

@andrjohns
Copy link
Collaborator Author

Nice! I hadn't thought of getting it to evaluate that way. I added it to the benchmark and there is a slight performance edge, but it's pretty small. However the NullaryExpr would likely generalise to expressions better than my version.

2020-04-14 16:15:30
Running ./EigCopyMethods
Run on (8 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 256 KiB (x4)
  L3 Unified 8192 KiB (x1)
Load Average: 0.72, 0.67, 0.63
-----------------------------------------------------------------
Benchmark                       Time             CPU   Iterations
-----------------------------------------------------------------
Matrix_EigMethod          3786747 ns      3786828 ns          178
Matrix_EigMethod_New      2529627 ns      2529648 ns          276
Matrix_EigMethod_Tadej    2466528 ns      2466532 ns          279
Matrix_LoopRMajor        18095283 ns     18094845 ns           38
Matrix_LoopCMajor         2496594 ns      2496625 ns          278

RowVec_EigMethod           168161 ns       168159 ns         4176
RowVec_EigMethod_New       136515 ns       136516 ns         5105
RowVec_EigMethod_Tadej     135394 ns       135397 ns         3931
RowVec_Loop                136652 ns       136653 ns         5057

ColVec_EigMethod           175110 ns       175113 ns         4000
ColVec_EigMethod_New       139783 ns       139783 ns         5099
ColVec_EigMethod_Tadej     138213 ns       138213 ns         4978
ColVec_Loop                132974 ns       132975 ns         5121

@t4c1
Copy link
Contributor

t4c1 commented Apr 14, 2020

Nice! This seems ready for a PR. Will you do it, or should I?

@t4c1
Copy link
Contributor

t4c1 commented Apr 14, 2020

Actually looking at results again differences between your and my implementations seem almost negligible (and a bit random). I still prefere NullaryExpr.

@andrjohns
Copy link
Collaborator Author

Yeah I'm in favour of NullaryExpr as well, since there's also the possibility of future benefits if the Eigen team further optimises nullary expressions. I'll take care of the PR, the only thing to clarify is the number of member functions.

For matrices of var, headers generally need various combinations of val/adj/vari, rather than all three. So we could have member functions like the following:

mat.read_vari_val_adj(out1, out2, out3)
mat.read_vari_val(out1, out2)
mat.read_vari_adj(out1, out3)
mat.read_val_adj(out2, out3)

Or is there a better way to handle this?

@t4c1
Copy link
Contributor

t4c1 commented Apr 14, 2020

I don't know of any case where we need vari. That would only mean mat.read_val_adj(out2, out3). But if you know of any go ahead and implement all of those.

@andrjohns
Copy link
Collaborator Author

It gets used a bit in the rev matrix functions (see the multiply header where the .vi() and .val() are used separately).

Which makes me think that the naming should be read_vi_* for consistency with the other member functions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants