Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Medium] Fast solvers for situations with many columns. #133

Open
tbenthompson opened this issue May 20, 2020 · 7 comments
Open

[Medium] Fast solvers for situations with many columns. #133

tbenthompson opened this issue May 20, 2020 · 7 comments
Labels
on hold not now, maybe never performance speed, memory, or accuracy

Comments

@tbenthompson
Copy link
Collaborator

Currently, the solvers all build a Fisher matrix/Hessian. If the data matrix is n x m, this Fisher matrix will have shape m x m. If m is less than a few thousand, this is fine. However, if m gets very large, possibly due to lots of categorical variables, the current solvers will be inadequate because they are constructing a very large dense matrix.

Currently, we can avoid that matrix construction using the diag_fisher=True option. However, this is not particularly well supported and performance might not be very good. We should consider adding a benchmark with more than 10,000 columns to better track performance in these cases.

As part of this effort, it would be nice to separate the diag_fisher behavior and simply use a different solver function for high dimensional problems.

@ElizabethSantorellaQC
Copy link
Contributor

ElizabethSantorellaQC commented May 20, 2020

Out of the solvers we currently have available, I think the only ones that don't construct the full Hessian would be L-BFGS and CD with diag_fisher (but let me know if I'm wrong). So I think diag_fisher is valuable as it's the only way we can reliably use both an L1 penalty and wide data.

I don't fully understand the value of the off-diagonal elements of the Hessian for CD when it's only updating one variable at a time. Why does performance degrade if we only use the diagonal?

@tbenthompson
Copy link
Collaborator Author

I don't fully understand the value of the off-diagonal elements of the Hessian for CD when it's only updating one variable at a time. Why does performance degrade if we only use the diagonal?

The current gradient is updated at each step of CD. So, if you only have the diagonal, the gradient update is incorrect for every variable besides the current one.

@ElizabethSantorellaQC
Copy link
Contributor

Notes from meeting:

  • ADMM currently does not seem to help here since it requires solving a linear system repeatedly, which makes it slower per step. Perhaps we could speed it up by using an iterative sparse solver.
  • The diag_fisher strategy might be helpful -- perhaps the implementation just had a bug.
  • h2o has at least 3 solvers available -- we should benchmark each of these.

@ElizabethSantorellaQC ElizabethSantorellaQC added the performance speed, memory, or accuracy label May 26, 2020
@tbenthompson
Copy link
Collaborator Author

Some potential steps:

  • Add a benchmark problem with 5,000 columns -- solvable but slow with current methods.
  • Reimplement the diag_fisher stuff.
  • Add a benchmark problem with 50,000 columns -- not solvable with current methods.

@tbenthompson
Copy link
Collaborator Author

An interesting idea here would be to implement an L-BFGS + CD solver. At each iteration, the new gradient information would be used to approximate the Hessian using a low rank approximation. Then, that low rank approximation could be used inside the CD solver instead of either the full data matrix or the true hessian.

@tbenthompson
Copy link
Collaborator Author

More good sources on "hessian-free" second order optimization methods.
https://andrew.gibiansky.com/blog/machine-learning/gauss-newton-matrix/
http://www.cs.toronto.edu/~jmartens/docs/Deep_HessianFree.pdf

@benjaminfspector
Copy link

@tbenthompson and I were just talking about another alternative which could work. Essentially, the idea would be to create a third, outermost loop in which one samples a tractable column minibatch (say, 500 columns), and then runs the solver for several iterations. In particular, this method might work well with @tbenthompson's new approximate Hessian code, since one could probably be quite aggressive with enforcing sparsity in the delta-D vector, making subsequent iterations of the middle loop very inexpensive due to the now-cheap Hessian update. It also has the advantage of being a relatively easy modification to the existing solver, so it wouldn't be hard to try and benchmark future methods against.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
on hold not now, maybe never performance speed, memory, or accuracy
Projects
None yet
Development

No branches or pull requests

3 participants