Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Efficient value_of and value_of_rec for matrices #250

Merged
merged 3 commits into from
Mar 28, 2016

Conversation

randommm
Copy link
Member

@randommm randommm commented Mar 3, 2016

Summary:

Fix issue #249

Intended Effect:

Write value_of and value_of_rec for matrix in a more efficient way.

value_of and value_of_rec for matrix are written in a inefficient way that calls mat(i, j), mat.rows() and mat.cols() on each iteration of the loop.

Sizes should be stored in a temporary int and the matrix data pointer in a temporary T*.

How to Verify:

Code review.

Side Effects:

None.

Documentation:

Not applicable.

Reviewer Suggestions:

@rtrangucci

for (int i=0; i < S; i++)
data_r[i] = value_of_rec(data_m[i]);
return result;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand not doing the double indexing --- that's slow. But I think the bit after the declaration/construction of result, we should just do the following unless you can demonstrate that what you did with all the temporaries is faster.

for (int i = 0; i < M.size(); ++i)
  result(i) = value_of(M(i));

We know the value can't be recursive through Eigen in any of our code, so we don't need value_of_rec.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bob-carpenter I'm confused about the comment above about value_of_rec.

I can think of a situation where we would have:

Eigen::Matrix<fvar<var>, R, C> x

and want Eigen::Matrix<double, R, C> y where y(i, j) = x(i, j).val_.val()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see --- the recursion isn't through container structures
(just std::vector), but also down through mixed types. Thanks
for pointing this out.

On Mar 3, 2016, at 3:43 PM, Rob Trangucci notifications@github.com wrote:

In stan/math/prim/mat/fun/value_of_rec.hpp:

  •  Eigen::Matrix<double, R, C> Md(M.rows(), M.cols());
    
  •  for (int j = 0; j < M.cols(); j++)
    
  •    for (int i = 0; i < M.rows(); i++)
    
  •      Md(i, j) = value_of_rec(M(i, j));
    
  •  return Md;
    
  •  Eigen::Matrix<double, R, C>
    
  •    result(M.rows(), M.cols());
    
  •  double \* data_r = result.data();
    
  •  const T \* data_m = M.data();
    
  •  int S = M.size();
    
  •  for (int i=0; i < S; i++)
    
  •    data_r[i] = value_of_rec(data_m[i]);
    
  •  return result;
    
  • }

@bob-carpenter I'm confused about the comment above about value_of_rec.

I can think of a situation where we would have:

Eigen::Matrix<fvar, R, C> x

and want y where y(i, j) = x(i, j).val_.val()


Reply to this email directly or view it on GitHub.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand not doing the double indexing --- that's slow. But I
think the bit after the declaration/construction of result, we should
just do the following unless you can demonstrate that what you did
with all the temporaries is faster.

|for (int i = 0; i < M.size(); ++i) result(i) = value_of(M(i)); |

The following code https://gist.github.com/randommm/4e61fabe8c5438daeebf

runs a kahan sum usingfour different methods:

(warning: requires more than 3GB to run)

(1)
for (size_t i = 0, nRows = xs.rows(), nCols = xs.cols(); i < nCols; ++i)
for (size_t j = 0; j < nRows; ++j)
//do something with matrix(i, j)

(2)
for (size_t i = 0, nRows = xs.rows(), nCols = xs.cols(); i < nRows; ++i)
for (size_t j = 0; j < nCols; ++j)
//do something with matrix(i, j)

(3)
for (size_t i = 0, size = xs.size(); i < size; i++)
//do something with matrix(i)

(4)
T * matrix_data = matrix.data();
for (size_t i = 0, size = xs.size(); i < size; i++)
//do something with matrix_data[i]

Here's the output of how much time each of them took:

$ g++ main.cpp pkg-config eigen3 --cflags --libs
$ ./a.out
start
25
49
25
7

The first run mat(i, j)

$ g++ main.cpp -DEIGEN_NO_DEBUG pkg-config eigen3 --cflags --libs
$ ./a.out
start
13
35
12
7

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, with -O3, this

   for (size_t i = 0; i < xs.rows(); ++i)
   for (size_t j = 0; j < xs.cols(); ++j)

doesn't seem to be slower than this:

   for (size_t i = 0, nRows = xs.rows(), nCols = xs.cols(); i < 

nRows; ++i)
for (size_t j = 0; j < nCols; ++j)

even when xs is not const.

(Sorry if it sounds rather obvious that the compiler's able to optimize this)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, maybe write out the time taken in more precision or run more loops in the micro-benchmark. It's hard to see what's going on when they all print out zero!

Eigen (and Stan) leans very very heavily on having a good optimizing compiler to inline function calls statically.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit off topic for this issue, but it's never obvious what a C++ compiler's going to do!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On 03/03/2016 20:35, Bob Carpenter wrote:

In stan/math/prim/mat/fun/value_of_rec.hpp
#250 (comment):

  •  Eigen::Matrix<double, R, C> Md(M.rows(), M.cols());
    
  •  for (int j = 0; j < M.cols(); j++)
    
  •    for (int i = 0; i < M.rows(); i++)
    
  •      Md(i, j) = value_of_rec(M(i, j));
    
  •  return Md;
    
  •  Eigen::Matrix<double, R, C>
    
  •    result(M.rows(), M.cols());
    
  •  double \* data_r = result.data();
    
  •  const T \* data_m = M.data();
    
  •  int S = M.size();
    
  •  for (int i=0; i < S; i++)
    
  •    data_r[i] = value_of_rec(data_m[i]);
    
  •  return result;
    
  • }

Well, maybe write out the time taken in more precision

Do you know how to do this with std::cout and std::time??

 cout << time(0) - now << endl;

(I don't think it'll be necessary thought, see below)

or run more loops in the micro-benchmark.

If I try to double the number of elements in the matrix, I get
std::bad_alloc.

But anyway, the 0's were coming up because the compiler optim realized
that I wasn't using the output of the function at all, so it wasn't even
running it!

So I changed;

 now = time(0);
 sum_kahan1t(test);
 cout << time(0) - now << endl;

to:

 now = time(0);
 cout << sum_kahan1t(test) << endl;
 cout << time(0) - now << endl;

And after running this a lot of times, I don't think there's any
difference between them, and even if there is, it's probably negligible
(iff -DEIGEN_NO_DEBUG and -O3 are used together).

$ g++ main.cpp -DEIGEN_NO_DEBUG pkg-config eigen3 --cflags --libs -O3
$ ./a.out
start
7697.42
3
7697.42
9
7697.42
4
7697.42
3

$ ./a.out
start
7697.42
3
7697.42
9
7697.42
3
7697.42
4


10 times run summed up:

 double te = 0;

 now = time(0);
 for (int i = 0; i < 10; i++)
   te += sum_kahan1t(test);
 cout << time(0) - now << endl;

 now = time(0);
 for (int i = 0; i < 10; i++)
   te += sum_kahan2t(test);
 cout << time(0) - now << endl;

 now = time(0);
 for (int i = 0; i < 10; i++)
   te += sum_kahan3t(test);
 cout << time(0) - now << endl;

 now = time(0);
 for (int i = 0; i < 10; i++)
   te += sum_kahan4t(test);
 cout << time(0) - now << endl;

 cout << te << endl;

$ g++ main.cpp -DEIGEN_NO_DEBUG pkg-config eigen3 --cflags --libs -O3
$ ./a.out
start
33
86
32
33
307897

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. That's what we wanted to see. The only thing I wouldn't have expected was the double-indexing to be just as fast as single indexing, but that must just be getting parallelized or at least not taking up measurable amounts of time compared to the floating point operations.

@rtrangucci
Copy link
Contributor

@randommm based on all the results above, shall we close this PR?

@bob-carpenter
Copy link
Contributor

We still want the direct return on the double instantiations.
Don't know why that wasn't there to begin with.

  • Bob

On Mar 4, 2016, at 2:13 PM, Rob Trangucci notifications@github.com wrote:

@randommm based on all the results above, shall we close this PR?


Reply to this email directly or view it on GitHub.

@bob-carpenter
Copy link
Contributor

This is all ready to go. @sycklik, is this OK to merge (I'm a bit shy to merge things given all the ramifications upstream on Stan).

@syclik syclik merged commit 2967b52 into develop Mar 28, 2016
@syclik syclik deleted the feature/issue-249-efficient-mat-value_of branch March 28, 2016 15:57
@syclik syclik modified the milestone: v2.11.0 Jul 27, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants