Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some updates #154

Merged
merged 2 commits into from
Apr 22, 2020
Merged

Some updates #154

merged 2 commits into from
Apr 22, 2020

Conversation

Vaibhavdixit02
Copy link
Member

No description provided.

@Vaibhavdixit02
Copy link
Member Author

@ChrisRackauckas I can't recreate the error on my end, can you try it out?

@ChrisRackauckas
Copy link
Member

I can't figure out how to get Stan installed on Windows 🤷

@ChrisRackauckas
Copy link
Member

Maybe @goedman could help?

@ChrisRackauckas
Copy link
Member

StanJulia/CmdStan.jl#92 Might be more than helpful to get this solved.

@ChrisRackauckas
Copy link
Member

We might as well merge and tag, and pick this up from there.

@Vaibhavdixit02 Vaibhavdixit02 merged commit 8359aaa into SciML:master Apr 22, 2020
@goedman
Copy link

goedman commented Apr 22, 2020 via email

@goedman
Copy link

goedman commented Apr 22, 2020

On my system it runs most of the time. In 15 runs I saw 2 kind of problems. Twice:

Unrecoverable error evaluating the log probability at the initial value.
Exception: Max number of iterations exceeded (100000). (in '/Users/rob/tmp/parameter_estimation_model.stan', line 22, column 2 to column 27)
Exception: Max number of iterations exceeded (100000). (in '/Users/rob/tmp/parameter_estimation_model.stan', line 22, column 2 to column 27)

and several times:

Test Failed at /Users/rob/.julia/dev/CmdStan/test/DiffEqBayes/deb_stan.jl:34
  Expression: ≈((sdf[sdf.parameters .== :theta1, :mean])[1], 1.0, atol = 0.3)
   Evaluated: 9.66635e-7 ≈ 1.0 (atol=0.3)
ERROR: LoadError: There was an error during testing
in expression starting at /Users/rob/.julia/dev/CmdStan/test/DiffEqBayes/deb_stan.jl:34

@goedman
Copy link

goedman commented Apr 22, 2020

The 2nd problem always after many informational messages.

@Vaibhavdixit02
Copy link
Member Author

I get the the informational messages but the test always passes for me

@ChrisRackauckas
Copy link
Member

We can make the priors have smaller ranges for Stan to help it pass. I still don't know what the return code -5 issue is on CI though.

@goedman
Copy link

goedman commented Apr 22, 2020

Below the output of the run (sampling) log if the return code = -5. It looks like it occurs in the call tp integrate_ode_rk45(sho,...:

method = sample (Default)
  sample
    num_samples = 300
    num_warmup = 500
    save_warmup = 0 (Default)
    thin = 1 (Default)
    adapt
      engaged = 1 (Default)
      gamma = 0.050000000000000003 (Default)
      delta = 0.80000000000000004 (Default)
      kappa = 0.75 (Default)
      t0 = 10 (Default)
      init_buffer = 75 (Default)
      term_buffer = 50 (Default)
      window = 25 (Default)
    algorithm = hmc (Default)
      hmc
        engine = nuts (Default)
          nuts
            max_depth = 10 (Default)
        metric = diag_e (Default)
        metric_file =  (Default)
        stepsize = 1 (Default)
        stepsize_jitter = 1
id = 1
data
  file = parameter_estimation_model_1.data.R
init = 2 (Default)
random
  seed = -1 (Default)
output
  file = parameter_estimation_model_samples_1.csv
  diagnostic_file =  (Default)
  refresh = 100 (Default)

Unrecoverable error evaluating the log probability at the initial value.
Exception: Max number of iterations exceeded (100000). (in '/Users/rob/.julia/dev/CmdStan/test/DiffEqBayes/tmp/parameter_estimation_model.stan', line 22, column 2 to column 27)
Exception: Max number of iterations exceeded (100000). (in '/Users/rob/.julia/dev/CmdStan/test/DiffEqBayes/tmp/parameter_estimation_model.stan', line 22, column 2 to column 27)

line 22 is the return line in sho():

    real[] sho(real t,real[] internal_var___u,real[] internal_var___p,real[] x_r,int[] x_i) {
  real internal_var___du[2];
  internal_var___du[1] = internal_var___p[1] * internal_var___u[1] - internal_var___u[1] * internal_var___u[2];
  internal_var___du[2] = internal_var___u[2] * -3 + internal_var___u[1] * internal_var___u[2];
  return internal_var___du;
}

@ChrisRackauckas
Copy link
Member

Oh, it looks like the Stan ODE solver is diverging. It makes sense that's where the issue is, thanks. Yeah we can just give it more restrictive priors (#155) and it should be fine.

@goedman
Copy link

goedman commented Apr 23, 2020

Just looking at the very first simulation (the others show similar behavior), the posterior for the model is pretty much identical to the used prior, e.g. theta1, also the 5%, 50% and 95% regions are worrisome:

Inference for Stan model: parameter_estimation_model_model
4 chains: each with iter=(1000,1000,1000,1000); warmup=(0,0,0,0); thin=(1,1,1,1); 4000 iterations saved.

Warmup took (0.54, 0.46, 0.53, 0.56) seconds, 2.1 seconds total
Sampling took (0.39, 0.45, 0.46, 0.48) seconds, 1.8 seconds total

                Mean     MCSE   StdDev    5%   50%   95%    N_Eff  N_Eff/s    R_hat
lp__             9.7  3.1e-02  1.2e+00   7.3    10    11  1.6e+03  9.0e+02  1.0e+00
accept_stat__   0.90  1.5e-02  1.4e-01  0.63  0.95   1.0  8.1e+01  4.5e+01  1.0e+00
stepsize__      0.72  5.3e-02  7.4e-02  0.60  0.77  0.78  2.0e+00  1.1e+00  2.1e+13
treedepth__      2.3  8.3e-02  5.6e-01   2.0   2.0   3.0  4.5e+01  2.5e+01  1.0e+00
n_leapfrog__     5.0  2.5e-01  2.1e+00   3.0   7.0   7.0  6.8e+01  3.8e+01  1.0e+00
divergent__     0.00      nan  0.0e+00  0.00  0.00  0.00      nan      nan      nan
energy__        -8.3  4.4e-02  1.7e+00   -10  -8.6  -5.0  1.5e+03  8.2e+02  1.0e+00
sigma1[1]       0.27  1.9e-03  8.3e-02  0.17  0.25  0.42  2.0e+03  1.1e+03  1.0e+00
sigma1[2]       0.25  1.4e-03  7.3e-02  0.16  0.24  0.39  2.9e+03  1.6e+03  1.0e+00
theta1           1.5  1.1e-04  5.6e-03   1.5   1.5   1.5  2.5e+03  1.4e+03  1.0e+00
theta[1]         1.5  1.1e-04  5.6e-03   1.5   1.5   1.5  2.5e+03  1.4e+03  1.0e+00

Samples were drawn using hmc with nuts.
For each parameter, N_Eff is a crude measure of effective sample size,
and R_hat is the potential scale reduction factor on split chains (at 
convergence, R_hat=1).

Is that what you expect? This run used a truncated(Normal(1.5, 0.5), 0, 2) prior.

![diffeqbayes_1_stan](https://user-images.githubusercontent.com/52200/80136644-03837680-8570-11ea-95e7-ef34368645b2.png

With a truncated(Normal(0.5, 0.5), 0, 2) prior results vary wildly, e.g. below run shows 1 chain at ~0.5, the others at ~1.5:

diffeqbayes_1b_stan

Anyway, there are probably good reasons to do it this way for this model.

@Vaibhavdixit02
Copy link
Member Author

@goedman the only reason to have the priors be so restrictive is to get the tests to pass 🙂

Do you have any idea if we are doing something wrong on our end or is there something going on at CmdStan's end?

I also had some issues with robustness of the Stan backend while preparing the benchmarks https://github.com/SciML/DiffEqBenchmarks.jl/blob/master/markdown/ParameterEstimation/DiffEqBayesLotkaVolterra.md#stanjl-backend, which was a little surprising to me because we have used the same problem in the past and Stan used to perform considerably better than others.

@Vaibhavdixit02
Copy link
Member Author

And also LV is a pretty standard problem, Stan has their own example for it https://mc-stan.org/users/documentation/case-studies/lotka-volterra-predator-prey.html

@goedman
Copy link

goedman commented Apr 24, 2020

Overall:

Using more data points, e.g. 30 vs. 10, seems to help the conditioning for the LV benchmark example.

Details:

Playing around with the 4 parameter LV example from the benchmarks I get much better results:

17×10 DataFrame. Omitted printing of 3 columns
│ Row │ parameters    │ mean      │ mcse       │ std       │ 5%        │ 50%      │ 95%      │
│     │ Symbol        │ Float64   │ Float64    │ Float64   │ Float64   │ Float64  │ Float64  │
├─────┼───────────────┼───────────┼────────────┼───────────┼───────────┼──────────┼──────────┤
│ 1   │ lp__          │ 8.59434   │ 0.0667253  │ 1.7976    │ 5.07189   │ 8.96115  │ 10.8216  │
│ 2   │ accept_stat__ │ 0.924566  │ 0.00213286 │ 0.10257   │ 0.711828  │ 0.965008 │ 0.999445 │
│ 3   │ stepsize__    │ 0.0921111 │ 0.00699752 │ 0.0099146 │ 0.0782337 │ 0.099047 │ 0.103591 │
│ 4   │ treedepth__   │ 4.76406   │ 0.0176853  │ 0.754038  │ 3.0       │ 5.0      │ 6.0      │
│ 5   │ n_leapfrog__  │ 38.6944   │ 2.21735    │ 18.5644   │ 15.0      │ 31.0     │ 63.0     │
│ 6   │ divergent__   │ 0.0       │ NaN        │ 0.0       │ 0.0       │ 0.0      │ 0.0      │
│ 7   │ energy__      │ -5.61333  │ 0.089947   │ 2.45267   │ -8.98815  │ -5.95624 │ -1.06611 │
│ 8   │ sigma1[1]     │ 0.549852  │ 0.00174832 │ 0.07488   │ 0.444792  │ 0.540745 │ 0.685217 │
│ 9   │ sigma1[2]     │ 0.406046  │ 0.00133368 │ 0.0563632 │ 0.325087  │ 0.400113 │ 0.504813 │
│ 10  │ theta1        │ 1.53285   │ 0.00202482 │ 0.054143  │ 1.44579   │ 1.53158  │ 1.621    │
│ 11  │ theta2        │ 1.1016    │ 0.00150346 │ 0.0497223 │ 1.02392   │ 1.10081  │ 1.18557  │
│ 12  │ theta3        │ 2.94679   │ 0.00559053 │ 0.148482  │ 2.7121    │ 2.94339  │ 3.19765  │
│ 13  │ theta4        │ 0.966377  │ 0.00211352 │ 0.0552594 │ 0.880345  │ 0.963816 │ 1.06272  │
│ 14  │ theta[1]      │ 1.53285   │ 0.00202482 │ 0.054143  │ 1.44579   │ 1.53158  │ 1.621    │
│ 15  │ theta[2]      │ 1.1016    │ 0.00150346 │ 0.0497223 │ 1.02392   │ 1.10081  │ 1.18557  │
│ 16  │ theta[3]      │ 2.94679   │ 0.00559053 │ 0.148482  │ 2.7121    │ 2.94339  │ 3.19765  │
│ 17  │ theta[4]      │ 0.966377  │ 0.00211352 │ 0.0552594 │ 0.880345  │ 0.963816 │ 1.06272  │

Dhmc results

1×6 Array{Float64,2}:
 1.53448  1.10125  2.9415  0.963586  0.530359  0.379295
1×6 Array{Float64,2}:
 0.0503889  0.0467975  0.13674  0.0513064  0.0760374  0.0528789

The dcmc result arrays are the means and std-s of the 4 thetas and 2 sigmas. Theta means are identical between stan and dcmc. Sigma means are slightly offset (see below figs).

Problems with dcmc with 10 data points disappeared after changing it to 30 (this made also stan behave better).

I run stan in 4 chains of 800 samples and dcmc as a single run of 3200 samples.

On my local machine I updated stan_inference() to accept a Stanmodel. I know Chris didn't feel that was significant a couple of years ago, but with many test runs it basically saves a lot of time for this example.

The script on Julia 1.4.1 with cmdstan v2.22.1 ( assumes the slightly updated stan_inference!) completes in ~25sec or ~15sec (without compilation) on my machine, comparable to dcmc.

theta_1

theta_2

theta_3

theta_4

@ChrisRackauckas
Copy link
Member

On my local machine I updated stan_inference() to accept a Stanmodel. I know Chris didn't feel that was significant a couple of years ago, but with many test runs it basically saves a lot of time for this example.

Interesting. Is there anything to upstream here?

@goedman
Copy link

goedman commented Apr 27, 2020

The ability to pass in a previously compiled Stanmodel is a very minor change. In my version I also shortened the model name from parameter_estimation_model to pem as the original very long name made the file names (in subdirectory tmp) in an editor sidebar bothersome.

As suggested by @Vaibhavdixit02, I also went back to the original LV example in Stan and used both DiffEqBayes and cmdstan to get a feeling how the results compare. Turns out the theta-s are very comparable. e.g.:

│ Row │ parameters │ mean      │ std        │ 5%        │ 95%       │ ess     │ n_eff/s │
│     │ Symbol     │ Float64   │ Float64    │ Float64   │ Float64   │ Float64 │ Float64 │
├─────┼────────────┼───────────┼────────────┼───────────┼───────────┼─────────┼─────────┤
│ 1   │ theta[1]   │ 0.550493  │ 0.0252924  │ 0.509728  │ 0.59288   │ 1107.64 │ 72.4427 │
│ 2   │ theta[2]   │ 0.0282025 │ 0.00197473 │ 0.0250953 │ 0.0315593 │ 1353.74 │ 88.5383 │
│ 3   │ theta[3]   │ 0.834142  │ 0.0379224  │ 0.772015  │ 0.898699  │ 1088.27 │ 71.1761 │
│ 4   │ theta[4]   │ 0.0258438 │ 0.00165386 │ 0.0231962 │ 0.0287042 │ 1133.23 │ 74.1161 │
│ 5   │ z_init[1]  │ 30.5238   │ 1.23295    │ 28.6004   │ 32.6044   │ 2664.72 │ 174.28  │
│ 6   │ z_init[2]  │ 4.1111    │ 0.200904   │ 3.80512   │ 4.45643   │ 1864.03 │ 121.913 │
│ 7   │ sigma[1]   │ 0.115413  │ 0.0262904  │ 0.0801187 │ 0.163579  │ 2339.38 │ 153.002 │
│ 8   │ sigma[2]   │ 0.136745  │ 0.0295192  │ 0.0964866 │ 0.193365  │ 2135.63 │ 139.676 │

The sigma results vary quite a bit.

The 89% bounds of the coefficient samples passed to the integration are pretty wide in both cases:

fig_01

and (the DiffEqBayes Stan language formulation):

fig_02

The lynx-hare dataset deviates quite a bit from the DiffEqBayes solution (see the first 2 plots) but all results for the theta coefficients are still pretty close.

For now I have stored all 7 runs in a subdirectories of CmdStan.jl and the cmdstan versions use the StatisticalRethinking approach.

@ChrisRackauckas
Copy link
Member

We just can't get this to pass on CI. Are we using the wrong version of CmdStan?

@goedman
Copy link

goedman commented May 7, 2020

Hi Chris, StatisticalRethinking is not using CmdStan.jl but StanSample.jl (a refactoring of CmdStan.jl). What would you like to pass on CI? The above example or the ability to pass in a previously compiled Stan model? Or both?

If so, it might be easier if I make a PR matching the DiffEqBayes setup.

@ChrisRackauckas
Copy link
Member

I am just hoping we can keep a working function. We can't seem to recreate your those results above: #155 . Stan seems to diverge on every run. If you have any ideas for how to fix it that would be great. The example from above, Lotka-Volterra, is what we use as our integration test and Stan is the only one that seems to consistently have issues with it. It doesn't seem to be stochastic either (it's 100% of runs that it fails), so I am at a bit of a loss as to what to do.

@goedman
Copy link

goedman commented May 7, 2020

Thanks, will have a go.

@goedman
Copy link

goedman commented May 8, 2020

I'm not sure what is the problem at turing_inference.jl line 47. As is, DiffEqBayes won't compile because @logpdf is undefined. To compile DiffEqBayes I have to disable that line.

The one-parameter cases without truncated(Normal(...)) fail regularly. Usually pointing to line 22 in the stan language model file, sho()'s return.

If I disable those cases, I cannot confirm your findings. Yes, occasionally Stan can't find a good starting value in one of the chains (as does Turing & DynamicHMC & the original Stan example).

I have for now a forked DiffEqBayes in StanJulia/DiffEqBayes.jl with my results and also an extra test script (stan_local_test.jl) that shows the chains. It only tests Stan, I disabled the other mcmc methods.

@ChrisRackauckas
Copy link
Member

I'm not sure what is the problem at turing_inference.jl line 47. As is, DiffEqBayes won't compile because @logpdf is undefined. To compile DiffEqBayes I have to disable that line.

Turing.jl had a breaking change.

@ChrisRackauckas
Copy link
Member

The one-parameter cases without truncated(Normal(...)) fail regularly. Usually pointing to line 22 in the stan language model file, sho()'s return.

I see, we can make that truncated if it helps

@goedman
Copy link

goedman commented May 8, 2020

That's the next item I wanted to look into. The four-parameter summary now looks ok, but the one-parameter results are weird. I would like to understand why 5% == 50% == 95% are all 1.5.
I expect it has to do with what the integrator returns

@goedman
Copy link

goedman commented May 8, 2020

Hmm, u_hat does not seem to explain that. For the one parameter case:

one_par_pred_prey

and the four parameters case:

four_par_pred_prey

@bbbales2
Copy link

FWIW, Stan used to pass these tests pretty robustly for about 2 years IIRC with the priors we had before #155, so this may be a fairly recent regression.

If something stopped working we definitely wanna figure that out!

I'm trying to reproduce this in R and Stan so we can figure out what happened.

I'm looking at the diff here and trying to use the previous model code: https://github.com/SciML/DiffEqBayes.jl/pull/155/files

Are we estimating the initial conditions? If not, are we using the true initial conditions for the ODE or are we just guessing they're equal to the first measured value?

Do I have the right priors for the differential equation parameters in the model (code at the end)? They're centered around the true values with standard deviation 0.01?

And is inv_gamma(4, 1) the prior used for the sigmas here? And the actual value for the standard deviations is 0.01? I'm not quite recovering the correct sigmas when I do inference on this model, and I think it's cause an inv_gamma(4, 1) distribution doesn't really put much prior mass around the standard deviations being 0.01:

prior

R code to generate data and run model:

library(deSolve)
library(cmdstanr)

a = 1.5
b = 1.0
c = 3.0
d = 1.0

f = function(t, y, parms) {
  list(c(a * y[1] - b * y[1] * y[2],
         -c * y[2] + d * y[1] * y[2]))
}

u0 = c(1.0, 1.0)

t = c(0.0, seq(1, 10, length = 10))
sig = 0.01

out = ode(u0, t, f)

model = cmdstan_model("lv2.stan")

N = length(t) - 1
y = out[, 2:3] + matrix(rnorm(length(t) * 2, mean = 0, sd = sig), ncol = 2)

plot(t, y[,1])
plot(t, y[,2])

fit = model$sample(data = list(N = N,
                               t0 = t[1],
                               ts = t[2:length(t)],
                               y = y[2:length(t),],
                               y_init = y[1,]),
                   cores = 4)

fit$summary()

Stan code:

functions {
  real[] dz_dt(real t,       // time
               real[] z,     // system state {prey, predator}
               real[] theta, // parameters
               real[] x_r,   // unused data
               int[] x_i) {
    real u = z[1];
    real v = z[2];
    
    real alpha = theta[1];
    real beta = theta[2];
    real gamma = theta[3];
    real delta = theta[4];
    
    real du_dt = (alpha - beta * v) * u;
    real dv_dt = (-gamma + delta * u) * v;
    
    return { du_dt, dv_dt };
  }
}

data {
  int<lower = 0> N;
  real t0;
  real ts[N];
  real y_init[2];
  real y[N, 2];
}

parameters {
  real<lower = 0.0, upper = 2.0> alpha;
  real<lower = 0.0, upper = 1.5> beta;
  real<lower = 0.0, upper = 4.0> gamma;
  real<lower = 0.0, upper = 2.0> delta;
  //real z_init[2];
  real<lower = 0> sigma[2];
}

transformed parameters {
  real theta[4] = { alpha, beta, gamma, delta };
  real z[N, 2]
  = integrate_ode_rk45(dz_dt, y_init, t0, ts, theta,
                       rep_array(0.0, 0), rep_array(0, 0),
                       1e-5, 1e-3, 5e2);
}

model {
  alpha ~ normal(1.5, 0.01);
  beta ~ normal(1.0, 0.01);
  gamma ~ normal(3.0, 0.01);
  delta ~ normal(1.0, 0.01);
  
  sigma ~ inv_gamma(4, 1);
  //z_init ~ normal(0, 10.0);

  for (k in 1:2) {
    //y_init[k] ~ normal(z_init[k], sigma[k]);
    y[ , k] ~ normal(z[, k], sigma[k]);
  }
}

Right now things seem to run:

> fit$summary()
# A tibble: 31 x 10
   variable    mean  median      sd     mad      q5    q95  rhat ess_bulk ess_tail
   <chr>      <dbl>   <dbl>   <dbl>   <dbl>   <dbl>  <dbl> <dbl>    <dbl>    <dbl>
 1 lp__     38.9    39.2    1.83    1.64    35.4    41.2    1.00    1784.    2201.
 2 alpha     1.50    1.50   0.00378 0.00378  1.50    1.51   1.00    2133.    2500.
 3 beta      0.998   0.998  0.00753 0.00741  0.986   1.01   1.00    2835.    2583.
 4 gamma     3.00    3.00   0.00909 0.00910  2.98    3.01   1.00    2870.    2687.
 5 delta     1.00    1.00   0.00538 0.00525  0.993   1.01   1.00    2327.    2136.
 6 sigma[1]  0.0907  0.0862 0.0262  0.0238   0.0570  0.141  1.00    2965.    2678.
 7 sigma[2]  0.0846  0.0813 0.0233  0.0214   0.0534  0.129  1.00    3397.    2564.

@goedman
Copy link

goedman commented May 14, 2020

Hi Ben, thanks for stepping in. The problem on CI has been resolved by making sure in the earlier part of the tests (the one parameter cases) to use a truncated(Normal()) as prior. You cover that in the parameter section.

The part I don't get is why it seemed to have worked in earlier versions with just Normal() priors.

Running the truncated model with sig = 0.5 returns sigma values close to 0.5. Also with priors that are not centered on the values used in the generating equations the model behaves quite well.

I did a quick test with Stan's optimize to find initial values but that failed. But I have in general not been very successful with Stan's optimize.

@bbbales2
Copy link

The problem on CI has been resolved by making sure in the earlier part of the tests (the one parameter cases) to use a truncated(Normal()) as prior

Oh okay, so one problem is with the one parameter model with non-truncated priors? What I see in red for the diff here is priors = [truncated(Normal(1.5,0.1),0,2)] (which I am interpreting as a truncated normal), so I might be looking at the wrong commit.

That commit also has changes to a bunch of other tests, which makes me think that it wasn't just the one parameter model causing problems. Any single model that used to work but doesn't now is a start to hunting this down.

It's entirely conceivable that a model sensitive to initialization could start working by changing the parameter constraints (cause of how random initialization works on the unconstrained space). ODEs can definitely be difficult in this way, especially if we're estimating initial conditions (I guess I've now talked about initial conditions in two settings so that's confusing), but I don't think we've changed anything about how we do initialization recently (so back to the same head scratching).

I guess is there a particular commit of the test file I should look at to get priors? I can have a look and try to write up the Stan model and make sure we're testing the same thing.

@Vaibhavdixit02
Copy link
Member Author

Hi @bbbales2 let me clarify things for you

Are we estimating the initial conditions? If not, are we using the true initial conditions for the ODE or are we just guessing they're equal to the first measured value?

In the first test we are not estimating the initial conditions. I checked the commits and it looks like we had switched this to Truncated Normal quite a while back, but that was to increase the test coverage so I am 90% sure it used to work without the Truncated distribution. In the second test we are estimating the initial conditions as well and this test failed without the truncation added in the this commit.

They're centered around the true values with standard deviation 0.01?

Yes

And is inv_gamma(4, 1) the prior used for the sigmas here?

We are using inv_gamma(3,3)

I hope this clarifies the situation a bit?

@goedman
Copy link

goedman commented May 14, 2020

Adding my $0.02:

The one-parameter test consists of 4 cases. When I started to look at the tests, the first prior used a truncated prior, the next 3 just Normal() priors. Invariably, one of the 3 Normal() tests failed.

In my tests I also updated the lower truncation bounds to 0.1 ( >0, what they used to be). Chris, in above commit, has gone even a step further, setting some of them to 0.5 or 1.

I think since Chris applied above update, CI was fine. Reverting line 26, 38 or 46 should immediately show the problem.

But I have not been able to pass CI with all of 26, 38 and 46 reverted and I am at a loss how to show that at some point, say with DiffEqBayes v2.12.0, this once worked. DiffEqBayes still uses cmdstan v2.21.0, I test against v2.23.0, but that's not the issue here.

Like you pony out Ben, with all formulations I have tested, including Bob C's original Lottka-Volterra example with real Lynx-Hare data, very occasionally I do see 1 chain diverge.

@bbbales2
Copy link

In the model here:

priors = [truncated(Normal(1.,0.01),0.5,2.0),truncated(Normal(1.,0.01),0.5,2.0),truncated(Normal(1.5,0.01),1.0,2.0)]
bayesian_result = stan_inference(prob1,t,data,priors;num_samples=300,
num_warmup=500,likelihood=Normal,sample_u0=true)
sdf = CmdStan.read_summary(bayesian_result.model)
@test sdf[sdf.parameters .== :theta1, :mean][1] 1. atol=3e-1
@test sdf[sdf.parameters .== :theta2, :mean][1] 1. atol=3e-1
@test sdf[sdf.parameters .== :theta3, :mean][1] 1.5 atol=3e-1

The first prior is for a, and what are the next two priors for?

Also if we're estimating the initial conditions, do we have priors for those?

@Vaibhavdixit02
Copy link
Member Author

The first two are priors for the initial conditions and the last one truncated(Normal(1.5,0.01),1.0,2.0) is for a

Should probably switch to named tuples here, thanks for the idea 🙂

@bbbales2
Copy link

Alright I think I tracked down what's happening. I can reproduce it in Stan 2.19.1, which is about a year old at this point. I don't know why this just showed up now or whatever, but it could be a seed thing or something.

This seems to be a problem when estimating initial conditions without a positive constraint on the initial conditions (since these are populations, I think it makes sense). I attached a reference model at the end without constraints. I ran like 100 chains, and 30 failed. With the initial conditions constrained to be positive, they all run.

But I think Stan is giving up early on these chains too. What's happening is when each chain starts, it picks a random initialization. Because these are random, you can get some initializations that are bad enough that the ODE solve fails (there's a parameter for the maximum number of steps taken between output points -- if this is hit, the sampler gives up).

The ODE solver is throwing an exception that Stan interprets as "give up totally" rather than "try new guesses" (with indexing problems, for instance, we do want to give up instantly rather than printing tons of errors). We made an issue to figure out what to do here: stan-dev/math#1887 , but I think this should be a "try new guesses" situation.

Leaning on the random initialization for this would be kinda sketchy in general though. It presumably works here in 2D cause the space isn't that big.

Here is a model that blows up:

functions {
  real[] dz_dt(real t,       // time
               real[] z,     // system state {prey, predator}
               real[] theta, // parameters
               real[] x_r,   // unused data
               int[] x_i) {
    real u = z[1];
    real v = z[2];
    
    real alpha = theta[1];
    real beta = theta[2];
    real gamma = theta[3];
    real delta = theta[4];
    
    real du_dt = (alpha - beta * v) * u;
    real dv_dt = (-gamma + delta * u) * v;
    
    return { du_dt, dv_dt };
  }
}

data {
  int<lower = 0> N;
  real t0;
  real ts[N];
  real y_init[2];
  real y[N, 2];
}

transformed data {
  real beta = 1.0;
  real gamma = 3.0;
  real delta = 1.0;
}

parameters {
  real alpha;
  real z_init[2];
  real<lower = 0> sigma[2];
}

transformed parameters {
  real theta[4] = { alpha, beta, gamma, delta };
  real z[N, 2]
  = integrate_ode_rk45(dz_dt, z_init, t0, ts, theta,
                       rep_array(0.0, 0), rep_array(0, 0),
                       1e-5, 1e-3, 500);
}

model {
  alpha ~ normal(1.5, 0.01);
  
  sigma ~ inv_gamma(3, 3);
  z_init ~ normal(1, 0.01);
  
  for (k in 1:2) {
    y_init[k] ~ normal(z_init[k], sigma[k]);
    y[ , k] ~ normal(z[, k], sigma[k]);
  }
}

And here is R code to generate data:

library(deSolve)
library(rstan)

a = 1.5
b = 1.0
c = 3.0
d = 1.0

f = function(t, y, parms) {
  list(c(a * y[1] - b * y[1] * y[2],
         -c * y[2] + d * y[1] * y[2]))
}

u0 = c(1.0, 1.0)

t = c(0.0, seq(1, 10, length = 10))
sig = 0.01

out = ode(u0, t, f)

N = length(t) - 1
y = out[, 2:3] + matrix(rnorm(length(t) * 2, mean = 0, sd = sig), ncol = 2)

plot(t, y[,1])
plot(t, y[,2])

data = list(N = N,
            t0 = t[1],
            ts = t[2:length(t)],
            y = y[2:length(t),],
            y_init = y[1,])

stan_rdump(names(data), "data.dat", env = list2env(data))

And then run this with your choice of cmdstan. You'll have to search around for a seed that breaks, but you'll know you have the problem when you see something like:

Unrecoverable error evaluating the log probability at the initial value.
Exception: Max number of iterations exceeded (500). (in 'lv3.stan', line 18, column 4 to column 28)
Exception: Max number of iterations exceeded (500). (in 'lv3.stan', line 18, column 4 to column 28)

@Vaibhavdixit02
Copy link
Member Author

Ah makes sense. Yes, you definitely don't want this to be a "give up totally" situation. That situation comes up when dealing with ODE parameter estimation frequently in my experience, we do handle this in some manner for other backends like here, having a way to detect this and use it to kick the sampler in a different direction can be helpful. Thank you for investigating it throughly!

@ChrisRackauckas
Copy link
Member

Awesome, thanks for the follow up.

@goedman
Copy link

goedman commented May 15, 2020

Super Ben, just ran your example and the problem points to the return statement in the function, similar as here

@bbbales2
Copy link

I guess also could some help me figure out how these benchmarks work over here: https://sciml.ai/2020/05/09/ModelDiscovery.html

Additionally, benchmarks on other ODE systems demonstrate a 5x and 3x performance advantage for Turing over Stan.

So I guess one of the models is this: https://benchmarks.sciml.ai/html/ParameterEstimation/DiffEqBayesLotkaVolterra.html

And that is a four parameter model with the priors:

[Truncated(Normal(1.5,0.5),0.5,2.5),Truncated(Normal(1.2,0.5),0,2),Truncated(Normal(3.0,0.5),1,4),Truncated(Normal(1.0,0.5),0,2)]

Are those the priors for just the Stan benchmarks or also the Julia benchmarks?

Are initial conditions being estimated?

Is the 3x-5x a walltime difference including warmup? Or after warmup only? Or is it N_eff/s, or N_eff/posterior draw?

Could someone generate a .csv file with simulated data for this so I can build a model and test it is Stan to see if anything unusual is happening?

Also it'd be nice to have the output of a few chains from Turing as .csv files so I can compare them to the Stan model and make sure we're sampling the same thing.

@Vaibhavdixit02
Copy link
Member Author

Are those the priors for just the Stan benchmarks or also the Julia benchmarks?

Both

Are initial conditions being estimated?

No

Is the 3x-5x a walltime difference including warmup

Yes. We use the defaults mentioned here for arguments not passed in explicitly.

I will try to share the two .csv files

@Vaibhavdixit02
Copy link
Member Author

https://drive.google.com/drive/folders/1AiNFv--gM-DEbABTWV--zDGNsKRfmxzk?usp=sharing

@bbbales2 you will find 2 csv files in the drive folder the simulated data in simdata.csv and the chains from Turing in turingchain.csv

@bbbales2
Copy link

Thanks! I'll see if I can reproduce this.

@goedman
Copy link

goedman commented May 15, 2020

I did do a couple of simple tests using the 4-parameter case.

perf_results.pdf

I used these scripts

Not a proper benchmark, but at least indicative. One problem is that the method stan_inference() always recompiles the stan program. Also Julia's BenchmarkingTools will force a recompile for each run.

@ChrisRackauckas
Copy link
Member

Is there any way to avoid that?

@bbbales2
Copy link

One problem is that the method stan_inference() always recompiles the stan program.

@goedman Yeah that would be nice to avoid. Stan compiles aren't quick, though we usually don't include them in the benchmarks.

The priors look different though: https://github.com/StanJulia/Stan.jl/blob/master/Examples_DiffEqBayes/diffeqbayes_tests/four_parameters/perf_turing.jl#L16 vs. https://github.com/StanJulia/Stan.jl/blob/master/Examples_DiffEqBayes/diffeqbayes_tests/four_parameters/perf_stan.jl#L28

@Vaibhavdixit02 I was able to build a Stan model that appears to be sampling the same thing as the files you sent me.

I ran one chain (1000 warmup samples, 9000 post-warmup samples) and it took 53s. I guess my computer isn't the same computer as the one here, but that's saying like 175s and I don't think my computer cores would be 3x faster than something else. Do you know what kind of computer that benchmark running on?

The performance in terms of ESS/posterior draw is similar. This can be kinda random chain to chain so nothing sets off alarms. So I don't think there's anything going on at a high level. With a dense metric Stan runs in 13s, though the Neff/posterior draw is lower. I don't know what adaptations/warmups Turing or DynamicHMC are using.

So for Stan:

  variable  mean median     sd    mad    q5   q95  rhat ess_bulk ess_tail
  <chr>    <dbl>  <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl>    <dbl>    <dbl>
1 theta[1] 1.54   1.53  0.0787 0.0763 1.41  1.67   1.00    2420.    2930.
2 theta[2] 1.11   1.10  0.0954 0.0865 0.972 1.28   1.00    3017.    2902.
3 theta[3] 3.06   3.05  0.239  0.233  2.69  3.47   1.00    2497.    2722.
4 theta[4] 0.942  0.939 0.0775 0.0761 0.820 1.07   1.00    2505.    3250.
5 σ[1]     0.469  0.456 0.0911 0.0816 0.348 0.636  1.00    3057.    3312.

For Turing.jl:

  variable  mean median     sd    mad    q5   q95  rhat ess_bulk ess_tail
  <chr>    <dbl>  <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl>    <dbl>    <dbl>
1 theta[1] 1.54   1.53  0.0838 0.0809 1.41  1.68   1.00    1831.    2348.
2 theta[2] 1.11   1.10  0.0986 0.0935 0.960 1.28   1.00    2488.    2820.
3 theta[3] 3.07   3.06  0.256  0.247  2.68  3.51   1.00    1889.    2128.
4 theta[4] 0.944  0.941 0.0826 0.0810 0.813 1.08   1.00    1860.    2057.
5 σ[1]     0.484  0.471 0.0952 0.0879 0.354 0.661  1.00    2864.    3249.

And then for them both mixed together (as if Stan is one chain and Turing is another, to verify they're sampling the same thing):

# A tibble: 5 x 10
  variable  mean median     sd    mad    q5   q95  rhat ess_bulk ess_tail
  <chr>    <dbl>  <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl>    <dbl>    <dbl>
1 theta[1] 1.54   1.53  0.0813 0.0789 1.41  1.68   1.00    4130.    4749.
2 theta[2] 1.11   1.10  0.0970 0.0897 0.966 1.28   1.00    5309.    5814.
3 theta[3] 3.07   3.06  0.247  0.240  2.68  3.49   1.00    4268.    5009.
4 theta[4] 0.943  0.940 0.0801 0.0785 0.816 1.08   1.00    4229.    5205.
5 σ[1]     0.477  0.463 0.0935 0.0852 0.350 0.651  1.00    5536.    6952.

Here is the Stan code:

functions {
  real[] dz_dt(real t,       // time
               real[] z,     // system state {prey, predator}
               real[] theta, // parameters
               real[] x_r,   // unused data
               int[] x_i) {
    real u = z[1];
    real v = z[2];
    
    real alpha = theta[1];
    real beta = theta[2];
    real gamma = theta[3];
    real delta = theta[4];
    
    real du_dt = (alpha - beta * v) * u;
    real dv_dt = (-gamma + delta * u) * v;
    
    return { du_dt, dv_dt };
  }
}

data {
  int<lower = 0> N;
  real t0;
  real ts[N];
  real z_init[2];
  real y[N, 2];
}

parameters {
  real<lower = 0.5, upper = 2.5> alpha;
  real<lower = 0.0, upper = 2.0> beta;
  real<lower = 1.0, upper = 4.0> gamma;
  real<lower = 0.0, upper = 2.0> delta;
  real<lower = 0> sigma;
}

transformed parameters {
  real theta[4] = { alpha, beta, gamma, delta };
  real z[N, 2]
  = integrate_ode_rk45(dz_dt, z_init, t0, ts, theta,
                       rep_array(0.0, 0), rep_array(0, 0),
                       1e-5, 1e-3, 5e2);
}

model {
  alpha ~ normal(1.5, 0.5);
  beta ~ normal(1.2, 0.5);
  gamma ~ normal(3.0, 0.5);
  delta ~ normal(1.0, 0.5);
  
  sigma ~ inv_gamma(3, 3);

  for (k in 1:2) {
    y[, k] ~ normal(z[, k], sigma);
  }
}

Here is the R code to run and compare:

library(cmdstanr)
library(tidyverse)
library(posterior)

u0 = c(1.0, 1.0)

model = cmdstan_model("lv2.stan")

N = length(t) - 1
y = read_csv("simdata.csv") %>%
  as.matrix

plot(t[2:length(t)], y[,1])
plot(t[2:length(t)], y[,2])

samples = read_csv("turingchain.csv")

fit = model$sample(data = list(N = N,
                               t0 = t[1],
                               ts = t[2:length(t)],
                               y = y,
                               z_init = u0),
                   iter_sampling = nrow(samples),
                   chains = 1)

stan_df = fit$draws() %>%
  as_draws_df() %>%
  select(.iteration, .chain, "theta[1]", "theta[2]", "theta[3]", "theta[4]", "sigma") %>%
  rename("σ[1]" = "sigma")

turing_df = samples %>%
  mutate(.chain = 2) %>%
  as_draws_df()

stan_df %>%
  summarise_draws()

turing_df %>%
  summarise_draws()

bind_draws(stan_df, turing_df, along = "chain") %>%
  summarise_draws()

@ChrisRackauckas
Copy link
Member

The computer is mentioned in the benchmark:

Julia Version 1.4.0
Commit b8e9a9ecc6 (2020-03-21 16:36 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin18.6.0)
  CPU: Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, skylake)

@bbbales2
Copy link

Alright I ran the Stan model on my laptop and it took 73s (i5-6300U CPU @ 2.40GHz).

I wonder if the tolerances were really small or something?

@Vaibhavdixit02
Copy link
Member Author

The tolerances used were the defaults on our end, reltol=1e-3, abstol=1e-6

@goedman
Copy link

goedman commented May 17, 2020

@bbbales2 , I'll run the benchmarks with your data.

The priors in my performance summary were indeed not equal but the timing is hardly changed when I use identical priors (and a noise term of 0.5rand(2) instead of the 0.01rand(2) ).

@ChrisRackauckas , I pass in an existing Stanmodel or nothing (default) after the priors argument:

function stan_inference(prob::DiffEqBase.DEProblem, t, data, 
  priors=nothing, 
  stanmodel=nothing;
  alg=:rk45, num_samples=1000, num_warmup=1000, reltol=1e-3,
  abstol=1e-6, maxiter=Int(1e5),likelihood=Normal,
  vars=(StanODEData(),InverseGamma(3,3)),nchains=1,
  sample_u0 = false, save_idxs = nothing, diffeq_string = nothing,
  printsummary = true)

If compiled previously I use bayesian_result.model.

and added:

  if isnothing(stanmodel)
    stanmodel = StanSample.SampleModel(
      "pem", parameter_estimation_model, [nchains],
      tmpdir=tmpdir,
      method=StanSample.Sample(num_samples=num_samples, num_warmup=num_warmup)) 
  end

just before the data Dict.

@ChrisRackauckas
Copy link
Member

Thanks @goedman . I added a PR for that here: #164

@Vaibhavdixit02 we should get our benchmarks updated to use this.

@goedman
Copy link

goedman commented May 17, 2020

@bbbales2

Using:

randomized = VectorOfArray([(sol(t[i]) + .5randn(2)) for i in 1:length(t)])
data = convert(Array,randomized)

priors = [truncated(Normal(1.5,0.5),0.5,2.5),truncated(Normal(1.2,0.5),0,2.0),
          truncated(Normal(1.0,0.5),1.0,4),truncated(Normal(1.0,0.5),0,2)]

and running both perf_turing.jl and perf_stan.jl i get very similar timing numbers to the ones I reported above for both Turing and Stan.

With 9000 post_warmup samples I get 43sec for Stan.

With num_samples=2000 and nchains=4 I get 14 secs for Stan.jl (with julia14 -p auto).

This is in line what we have found with other benchmarks. Performance for small models (few parameters/not too many observations) has gotten much more comparable between Stan, DynamicHMC and Turing/AMHC. This is amazing progress from 2 years ago!

julia> versioninfo()
Julia Version 1.4.1
Commit 381693d3df* (2020-04-14 17:20 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin18.7.0)
  CPU: Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, skylake)
Environment:
  JULIA_PKG_SERVER = geo.pkg.julialang.org
  JULIA_EDITOR = subl
  JULIA_CMDSTAN_HOME = /Users/rob/Projects/StanSupport/cmdstan
  JULIA_SVG_BROWSER = Google Chrome.app
  JULIA_SPECIALFUNCTIONS_BUILD_SOURCE = true
  JULIA_PKG3_PRECOMPILE = true

All runs with cmdstan 2.23.0.

@ChrisRackauckas
Copy link
Member

Alright, we'll get our posted benchmarks updated. SciML/SciMLBenchmarks.jl#103 . Indeed, Turing has come a really long way: 2 years ago it failed all of our tests and was quite slow due to AD issues, but it's cleaned all of that up.

@Vaibhavdixit02
Copy link
Member Author

@goedman I tried the timings after #164

I see the speed up on using compiled model on the one parameter model quite evidently but there is not a lot of gain for the 4 parameters case.

I am getting timings around

Warmup took (20) seconds, 20 seconds total
Sampling took (174) seconds, 2.9 minutes total

for the four parameter case for 1000 warmup and 9000 post-warmup samples.

Can you run the stan_inference for the 4 parameters and confirm the timing?

@goedman
Copy link

goedman commented May 18, 2020

Will do. A bit busy right now, will have a bit more time over the next 2 or 3 days.

But with below program, @Btime for 10000 post-warmup samples, returns:

  50.731 s (1361490 allocations: 56.93 MiB)

BenchmarkTools is not suitable for Stan but subtracting 10 seconds from above number is typically what I see without compilation. The final @time run confirms that:

 40.519250 seconds (1.36 M allocations: 57.117 MiB, 0.09% gc time)

I roughly see:

File /Users/rob/tmp/parameter_estimation_model.stan will be updated.

Inference for Stan model: parameter_estimation_model_model
1 chains: each with iter=(10000); warmup=(0); thin=(1); 10000 iterations saved.

Warmup took (3.7) seconds, 3.7 seconds total
Sampling took (33) seconds, 33 seconds total

Is it hard for you to try it on cmdstan v2.23.0? But I can't believe this explains the factor of 4 between your result and what both Ben and I found. It baffles me.

You're right, if sampling takes ~3mins, the 10sec or so it take to compile the model is irrelevant. That was way back Chris' (valid) argument not to bother with it.

Also the 20 secs for 1000 warmup samples vs. ~180 secs for 9000 samples seems reasonable.
Or as in above 3secs warmups, 33 secs post warmup sampling.

Finally, I would never run it this way, just take 2250 samples in 4 chains and append the chains.
The 2 issues I would like to dig into a bit more is 1) why in this case we ask for so many post samples (9000 or 10000?) and 2) shouldn't we simply use an init Dict (like the data Dict) in this case for the parameters. I expect that will reduce the +/- 6sec variance between run timings.

Anyway, this is the script I ran (just to be sure :-):

using DiffEqBayes, CmdStan
using Distributions, BenchmarkTools
using OrdinaryDiffEq, RecursiveArrayTools, ParameterizedFunctions
using Plots
gr(fmt=:png)

#### Initializing the problem

f = @ode_def LotkaVolterraTest begin
    dx = a*x - b*x*y
    dy = -c*y + d*x*y
end a b c d

u0 = [1.0,1.0]
tspan = (0.0,10.0)
p = [1.5,1.0,3.0,1,0]

prob = ODEProblem(f,u0,tspan,p)
sol = solve(prob,Tsit5())

#### We take the solution data obtained and add noise to it to obtain data for using in the Bayesian Inference of the parameters

t = collect(range(1,stop=10,length=10))
sig = 0.49
data = convert(Array, VectorOfArray([(sol(t[i]) + sig*randn(2)) for i in 1:length(t)]))

#### Plots of the actual data and generated data

scatter(t, data[1,:], lab="#prey (data)")
scatter!(t, data[2,:], lab="#predator (data)")
plot!(sol)

priors = [
  truncated(Normal(1.5,0.5),0.5,2.5),
  truncated(Normal(1.2,0.5),0,2),
  truncated(Normal(3.0,0.5),1,4),
  truncated(Normal(1.0,0.5),0,2)
]

@btime bayesian_result_stan = stan_inference(prob,t,data,priors,
  num_samples=10_000,printsummary=true)

@time bayesian_result_stan = stan_inference(prob,t,data,priors,
  num_samples=10_000,printsummary=true)

@goedman
Copy link

goedman commented May 18, 2020

And if I run the Turing test I get about 20secs, twice as fast as Stan for this example!!

@ChrisRackauckas
Copy link
Member

Indeed, since the functions are stochastic BenchmarkTools doesn't make sense in general. For our benchmarks, I am thinking the following:

  • 3 runs of each after compilation (not like compilation will matter much here though, but just good practice).
  • Turing with Tsit5, CVODE_Adams, and dopri5, to isolate the ODE solver differences (that probably accounts for the 2x).
  • Show each of the chain results. 3 is a decent measure to eyeball
  • Take the average of the times, but 3 gives enough to eyeball variability

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants