-
-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some updates #154
Some updates #154
Conversation
@ChrisRackauckas I can't recreate the error on my end, can you try it out? |
I can't figure out how to get Stan installed on Windows 🤷 |
Maybe @goedman could help? |
StanJulia/CmdStan.jl#92 Might be more than helpful to get this solved. |
We might as well merge and tag, and pick this up from there. |
It seems the cmdstan binary is working ok because the first summary is created (one parameter case).
The 2nd compilation fails with a `stanc` error (`Return code = -5` message), hence a stan language error.
I’m trying to re-create the problem on my machine (MacOS):
```
julia> include("/Users/rob/.julia/dev/CmdStan/test/DiffEqBayes/deb_stan.jl");
[ Info: Precompiling CmdStan [593b3428-ca2f-500c-ae53-031589ec8ddd]
[ Info: Precompiling DiffEqBayes [ebbdde9d-f333-5424-9be2-dbf1e9acfb5e]
[ Info: Precompiling OrdinaryDiffEq [1dea7af3-3e70-54e6-95c3-0bf5283fa5ed]
[ Info: Precompiling ParameterizedFunctions [65888b18-ceab-5e60-b2b9-181511a3b968]
One parameter case
ERROR: LoadError: type ODESystem has no field states
```
Do I need a master version?
Rob J Goedman
goedman@icloud.com
… On Apr 22, 2020, at 13:08, Christopher Rackauckas ***@***.***> wrote:
We might as well merge and tag, and pick this up from there.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#154 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAMX2DYVE2SD52VIZJS2NTRN4QBVANCNFSM4MOGATFQ>.
|
On my system it runs most of the time. In 15 runs I saw 2 kind of problems. Twice:
and several times:
|
The 2nd problem always after many informational messages. |
I get the the informational messages but the test always passes for me |
We can make the priors have smaller ranges for Stan to help it pass. I still don't know what the return code -5 issue is on CI though. |
Below the output of the run (sampling) log if the return code = -5. It looks like it occurs in the call tp
line 22 is the return line in sho():
|
Oh, it looks like the Stan ODE solver is diverging. It makes sense that's where the issue is, thanks. Yeah we can just give it more restrictive priors (#155) and it should be fine. |
Just looking at the very first simulation (the others show similar behavior), the posterior for the model is pretty much identical to the used prior, e.g. theta1, also the 5%, 50% and 95% regions are worrisome:
Is that what you expect? This run used a ![diffeqbayes_1_stan](https://user-images.githubusercontent.com/52200/80136644-03837680-8570-11ea-95e7-ef34368645b2.png With a Anyway, there are probably good reasons to do it this way for this model. |
@goedman the only reason to have the priors be so restrictive is to get the tests to pass 🙂 Do you have any idea if we are doing something wrong on our end or is there something going on at CmdStan's end? I also had some issues with robustness of the Stan backend while preparing the benchmarks https://github.com/SciML/DiffEqBenchmarks.jl/blob/master/markdown/ParameterEstimation/DiffEqBayesLotkaVolterra.md#stanjl-backend, which was a little surprising to me because we have used the same problem in the past and Stan used to perform considerably better than others. |
And also LV is a pretty standard problem, Stan has their own example for it https://mc-stan.org/users/documentation/case-studies/lotka-volterra-predator-prey.html |
Overall:Using more data points, e.g. 30 vs. 10, seems to help the conditioning for the LV benchmark example. Details:Playing around with the 4 parameter LV example from the benchmarks I get much better results:
The dcmc result arrays are the means and std-s of the 4 thetas and 2 sigmas. Theta means are identical between stan and dcmc. Sigma means are slightly offset (see below figs). Problems with dcmc with 10 data points disappeared after changing it to 30 (this made also stan behave better). I run stan in 4 chains of 800 samples and dcmc as a single run of 3200 samples. On my local machine I updated The script on Julia 1.4.1 with cmdstan v2.22.1 ( assumes the slightly updated |
Interesting. Is there anything to upstream here? |
The ability to pass in a previously compiled Stanmodel is a very minor change. In my version I also shortened the model name from As suggested by @Vaibhavdixit02, I also went back to the original LV example in Stan and used both DiffEqBayes and cmdstan to get a feeling how the results compare. Turns out the theta-s are very comparable. e.g.:
The sigma results vary quite a bit. The 89% bounds of the coefficient samples passed to the integration are pretty wide in both cases: and (the DiffEqBayes Stan language formulation): The lynx-hare dataset deviates quite a bit from the DiffEqBayes solution (see the first 2 plots) but all results for the theta coefficients are still pretty close. For now I have stored all 7 runs in a subdirectories of CmdStan.jl and the cmdstan versions use the StatisticalRethinking approach. |
We just can't get this to pass on CI. Are we using the wrong version of CmdStan? |
Hi Chris, StatisticalRethinking is not using CmdStan.jl but StanSample.jl (a refactoring of CmdStan.jl). What would you like to pass on CI? The above example or the ability to pass in a previously compiled Stan model? Or both? If so, it might be easier if I make a PR matching the DiffEqBayes setup. |
I am just hoping we can keep a working function. We can't seem to recreate your those results above: #155 . Stan seems to diverge on every run. If you have any ideas for how to fix it that would be great. The example from above, Lotka-Volterra, is what we use as our integration test and Stan is the only one that seems to consistently have issues with it. It doesn't seem to be stochastic either (it's 100% of runs that it fails), so I am at a bit of a loss as to what to do. |
Thanks, will have a go. |
I'm not sure what is the problem at turing_inference.jl line 47. As is, DiffEqBayes won't compile because @logpdf is undefined. To compile DiffEqBayes I have to disable that line. The one-parameter cases without truncated(Normal(...)) fail regularly. Usually pointing to line 22 in the stan language model file, sho()'s return. If I disable those cases, I cannot confirm your findings. Yes, occasionally Stan can't find a good starting value in one of the chains (as does Turing & DynamicHMC & the original Stan example). I have for now a forked DiffEqBayes in StanJulia/DiffEqBayes.jl with my results and also an extra test script (stan_local_test.jl) that shows the chains. It only tests Stan, I disabled the other mcmc methods. |
Turing.jl had a breaking change. |
I see, we can make that truncated if it helps |
That's the next item I wanted to look into. The four-parameter summary now looks ok, but the one-parameter results are weird. I would like to understand why 5% == 50% == 95% are all 1.5. |
If something stopped working we definitely wanna figure that out! I'm trying to reproduce this in R and Stan so we can figure out what happened. I'm looking at the diff here and trying to use the previous model code: https://github.com/SciML/DiffEqBayes.jl/pull/155/files Are we estimating the initial conditions? If not, are we using the true initial conditions for the ODE or are we just guessing they're equal to the first measured value? Do I have the right priors for the differential equation parameters in the model (code at the end)? They're centered around the true values with standard deviation 0.01? And is R code to generate data and run model: library(deSolve)
library(cmdstanr)
a = 1.5
b = 1.0
c = 3.0
d = 1.0
f = function(t, y, parms) {
list(c(a * y[1] - b * y[1] * y[2],
-c * y[2] + d * y[1] * y[2]))
}
u0 = c(1.0, 1.0)
t = c(0.0, seq(1, 10, length = 10))
sig = 0.01
out = ode(u0, t, f)
model = cmdstan_model("lv2.stan")
N = length(t) - 1
y = out[, 2:3] + matrix(rnorm(length(t) * 2, mean = 0, sd = sig), ncol = 2)
plot(t, y[,1])
plot(t, y[,2])
fit = model$sample(data = list(N = N,
t0 = t[1],
ts = t[2:length(t)],
y = y[2:length(t),],
y_init = y[1,]),
cores = 4)
fit$summary() Stan code: functions {
real[] dz_dt(real t, // time
real[] z, // system state {prey, predator}
real[] theta, // parameters
real[] x_r, // unused data
int[] x_i) {
real u = z[1];
real v = z[2];
real alpha = theta[1];
real beta = theta[2];
real gamma = theta[3];
real delta = theta[4];
real du_dt = (alpha - beta * v) * u;
real dv_dt = (-gamma + delta * u) * v;
return { du_dt, dv_dt };
}
}
data {
int<lower = 0> N;
real t0;
real ts[N];
real y_init[2];
real y[N, 2];
}
parameters {
real<lower = 0.0, upper = 2.0> alpha;
real<lower = 0.0, upper = 1.5> beta;
real<lower = 0.0, upper = 4.0> gamma;
real<lower = 0.0, upper = 2.0> delta;
//real z_init[2];
real<lower = 0> sigma[2];
}
transformed parameters {
real theta[4] = { alpha, beta, gamma, delta };
real z[N, 2]
= integrate_ode_rk45(dz_dt, y_init, t0, ts, theta,
rep_array(0.0, 0), rep_array(0, 0),
1e-5, 1e-3, 5e2);
}
model {
alpha ~ normal(1.5, 0.01);
beta ~ normal(1.0, 0.01);
gamma ~ normal(3.0, 0.01);
delta ~ normal(1.0, 0.01);
sigma ~ inv_gamma(4, 1);
//z_init ~ normal(0, 10.0);
for (k in 1:2) {
//y_init[k] ~ normal(z_init[k], sigma[k]);
y[ , k] ~ normal(z[, k], sigma[k]);
}
} Right now things seem to run: > fit$summary()
# A tibble: 31 x 10
variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 lp__ 38.9 39.2 1.83 1.64 35.4 41.2 1.00 1784. 2201.
2 alpha 1.50 1.50 0.00378 0.00378 1.50 1.51 1.00 2133. 2500.
3 beta 0.998 0.998 0.00753 0.00741 0.986 1.01 1.00 2835. 2583.
4 gamma 3.00 3.00 0.00909 0.00910 2.98 3.01 1.00 2870. 2687.
5 delta 1.00 1.00 0.00538 0.00525 0.993 1.01 1.00 2327. 2136.
6 sigma[1] 0.0907 0.0862 0.0262 0.0238 0.0570 0.141 1.00 2965. 2678.
7 sigma[2] 0.0846 0.0813 0.0233 0.0214 0.0534 0.129 1.00 3397. 2564. |
Hi Ben, thanks for stepping in. The problem on CI has been resolved by making sure in the earlier part of the tests (the one parameter cases) to use a truncated(Normal()) as prior. You cover that in the parameter section. The part I don't get is why it seemed to have worked in earlier versions with just Normal() priors. Running the I did a quick test with Stan's optimize to find initial values but that failed. But I have in general not been very successful with Stan's optimize. |
Oh okay, so one problem is with the one parameter model with non-truncated priors? What I see in red for the diff here is That commit also has changes to a bunch of other tests, which makes me think that it wasn't just the one parameter model causing problems. Any single model that used to work but doesn't now is a start to hunting this down. It's entirely conceivable that a model sensitive to initialization could start working by changing the parameter constraints (cause of how random initialization works on the unconstrained space). ODEs can definitely be difficult in this way, especially if we're estimating initial conditions (I guess I've now talked about initial conditions in two settings so that's confusing), but I don't think we've changed anything about how we do initialization recently (so back to the same head scratching). I guess is there a particular commit of the test file I should look at to get priors? I can have a look and try to write up the Stan model and make sure we're testing the same thing. |
Hi @bbbales2 let me clarify things for you
In the first test we are not estimating the initial conditions. I checked the commits and it looks like we had switched this to Truncated Normal quite a while back, but that was to increase the test coverage so I am 90% sure it used to work without the Truncated distribution. In the second test we are estimating the initial conditions as well and this test failed without the truncation added in the this commit.
Yes
We are using I hope this clarifies the situation a bit? |
Adding my $0.02: The one-parameter test consists of 4 cases. When I started to look at the tests, the first prior used a truncated prior, the next 3 just Normal() priors. Invariably, one of the 3 Normal() tests failed. In my tests I also updated the lower truncation bounds to 0.1 ( >0, what they used to be). Chris, in above commit, has gone even a step further, setting some of them to 0.5 or 1. I think since Chris applied above update, CI was fine. Reverting line 26, 38 or 46 should immediately show the problem. But I have not been able to pass CI with all of 26, 38 and 46 reverted and I am at a loss how to show that at some point, say with DiffEqBayes v2.12.0, this once worked. DiffEqBayes still uses cmdstan v2.21.0, I test against v2.23.0, but that's not the issue here. Like you pony out Ben, with all formulations I have tested, including Bob C's original Lottka-Volterra example with real Lynx-Hare data, very occasionally I do see 1 chain diverge. |
In the model here: Lines 26 to 33 in e4af121
The first prior is for Also if we're estimating the initial conditions, do we have priors for those? |
The first two are priors for the initial conditions and the last one Should probably switch to named tuples here, thanks for the idea 🙂 |
Alright I think I tracked down what's happening. I can reproduce it in Stan 2.19.1, which is about a year old at this point. I don't know why this just showed up now or whatever, but it could be a seed thing or something. This seems to be a problem when estimating initial conditions without a positive constraint on the initial conditions (since these are populations, I think it makes sense). I attached a reference model at the end without constraints. I ran like 100 chains, and 30 failed. With the initial conditions constrained to be positive, they all run. But I think Stan is giving up early on these chains too. What's happening is when each chain starts, it picks a random initialization. Because these are random, you can get some initializations that are bad enough that the ODE solve fails (there's a parameter for the maximum number of steps taken between output points -- if this is hit, the sampler gives up). The ODE solver is throwing an exception that Stan interprets as "give up totally" rather than "try new guesses" (with indexing problems, for instance, we do want to give up instantly rather than printing tons of errors). We made an issue to figure out what to do here: stan-dev/math#1887 , but I think this should be a "try new guesses" situation. Leaning on the random initialization for this would be kinda sketchy in general though. It presumably works here in 2D cause the space isn't that big. Here is a model that blows up:
And here is R code to generate data:
And then run this with your choice of cmdstan. You'll have to search around for a seed that breaks, but you'll know you have the problem when you see something like:
|
Ah makes sense. Yes, you definitely don't want this to be a "give up totally" situation. That situation comes up when dealing with ODE parameter estimation frequently in my experience, we do handle this in some manner for other backends like here, having a way to detect this and use it to kick the sampler in a different direction can be helpful. Thank you for investigating it throughly! |
Awesome, thanks for the follow up. |
Super Ben, just ran your example and the problem points to the return statement in the function, similar as here |
I guess also could some help me figure out how these benchmarks work over here: https://sciml.ai/2020/05/09/ModelDiscovery.html
So I guess one of the models is this: https://benchmarks.sciml.ai/html/ParameterEstimation/DiffEqBayesLotkaVolterra.html And that is a four parameter model with the priors:
Are those the priors for just the Stan benchmarks or also the Julia benchmarks? Are initial conditions being estimated? Is the 3x-5x a walltime difference including warmup? Or after warmup only? Or is it N_eff/s, or N_eff/posterior draw? Could someone generate a .csv file with simulated data for this so I can build a model and test it is Stan to see if anything unusual is happening? Also it'd be nice to have the output of a few chains from Turing as .csv files so I can compare them to the Stan model and make sure we're sampling the same thing. |
Both
No
Yes. We use the defaults mentioned here for arguments not passed in explicitly. I will try to share the two .csv files |
https://drive.google.com/drive/folders/1AiNFv--gM-DEbABTWV--zDGNsKRfmxzk?usp=sharing @bbbales2 you will find 2 csv files in the drive folder the simulated data in |
Thanks! I'll see if I can reproduce this. |
I did do a couple of simple tests using the 4-parameter case. I used these scripts Not a proper benchmark, but at least indicative. One problem is that the method |
Is there any way to avoid that? |
@goedman Yeah that would be nice to avoid. Stan compiles aren't quick, though we usually don't include them in the benchmarks. The priors look different though: https://github.com/StanJulia/Stan.jl/blob/master/Examples_DiffEqBayes/diffeqbayes_tests/four_parameters/perf_turing.jl#L16 vs. https://github.com/StanJulia/Stan.jl/blob/master/Examples_DiffEqBayes/diffeqbayes_tests/four_parameters/perf_stan.jl#L28 @Vaibhavdixit02 I was able to build a Stan model that appears to be sampling the same thing as the files you sent me. I ran one chain (1000 warmup samples, 9000 post-warmup samples) and it took 53s. I guess my computer isn't the same computer as the one here, but that's saying like 175s and I don't think my computer cores would be 3x faster than something else. Do you know what kind of computer that benchmark running on? The performance in terms of ESS/posterior draw is similar. This can be kinda random chain to chain so nothing sets off alarms. So I don't think there's anything going on at a high level. With a dense metric Stan runs in 13s, though the Neff/posterior draw is lower. I don't know what adaptations/warmups Turing or DynamicHMC are using. So for Stan:
For Turing.jl:
And then for them both mixed together (as if Stan is one chain and Turing is another, to verify they're sampling the same thing):
Here is the Stan code:
Here is the R code to run and compare:
|
The computer is mentioned in the benchmark:
|
Alright I ran the Stan model on my laptop and it took 73s (i5-6300U CPU @ 2.40GHz). I wonder if the tolerances were really small or something? |
The tolerances used were the defaults on our end, |
@bbbales2 , I'll run the benchmarks with your data. The priors in my performance summary were indeed not equal but the timing is hardly changed when I use identical priors (and a noise term of 0.5rand(2) instead of the 0.01rand(2) ). @ChrisRackauckas , I pass in an existing Stanmodel or nothing (default) after the priors argument:
If compiled previously I use and added:
just before the data Dict. |
Thanks @goedman . I added a PR for that here: #164 @Vaibhavdixit02 we should get our benchmarks updated to use this. |
Using:
and running both perf_turing.jl and perf_stan.jl i get very similar timing numbers to the ones I reported above for both Turing and Stan. With 9000 post_warmup samples I get 43sec for Stan. With num_samples=2000 and nchains=4 I get 14 secs for Stan.jl (with This is in line what we have found with other benchmarks. Performance for small models (few parameters/not too many observations) has gotten much more comparable between Stan, DynamicHMC and Turing/AMHC. This is amazing progress from 2 years ago!
All runs with cmdstan 2.23.0. |
Alright, we'll get our posted benchmarks updated. SciML/SciMLBenchmarks.jl#103 . Indeed, Turing has come a really long way: 2 years ago it failed all of our tests and was quite slow due to AD issues, but it's cleaned all of that up. |
@goedman I tried the timings after #164 I see the speed up on using compiled model on the one parameter model quite evidently but there is not a lot of gain for the 4 parameters case. I am getting timings around
for the four parameter case for 1000 warmup and 9000 post-warmup samples. Can you run the |
Will do. A bit busy right now, will have a bit more time over the next 2 or 3 days. But with below program, @Btime for 10000 post-warmup samples, returns:
BenchmarkTools is not suitable for Stan but subtracting 10 seconds from above number is typically what I see without compilation. The final @time run confirms that:
I roughly see:
Is it hard for you to try it on cmdstan v2.23.0? But I can't believe this explains the factor of 4 between your result and what both Ben and I found. It baffles me. You're right, if sampling takes ~3mins, the 10sec or so it take to compile the model is irrelevant. That was way back Chris' (valid) argument not to bother with it. Also the 20 secs for 1000 warmup samples vs. ~180 secs for 9000 samples seems reasonable. Finally, I would never run it this way, just take 2250 samples in 4 chains and append the chains. Anyway, this is the script I ran (just to be sure :-):
|
And if I run the Turing test I get about 20secs, twice as fast as Stan for this example!! |
Indeed, since the functions are stochastic BenchmarkTools doesn't make sense in general. For our benchmarks, I am thinking the following:
|
No description provided.