Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tidy.mipo and tidy.glance #240

Merged
merged 5 commits into from
Jun 4, 2020
Merged

tidy.mipo and tidy.glance #240

merged 5 commits into from
Jun 4, 2020

Conversation

vincentarelbundock
Copy link
Contributor

This PR does:

  1. Add tests to pool.r.rsquared to make sure we get the same results as before after I modify it.
  2. Add a glanced object to mipo. This allows us to retrieve information about the original models stored in mira.
  3. Add support for mipo objects in pool.r.squared. This is step 3 because we need access to the new glanced object.
  4. Add new glance.mipo and tidy.mipo methods.

I have two technical questions for @stefvanbuuren :

  1. In pool.r.squared, did you intend to hardcode the value "3", or should it be df.residuals? https://github.com/vincentarelbundock/mice/blob/tidymipo/R/pool.r.squared.R#L89
  2. For mipo objects, I don't have a good way to detect if the original model was lm. Do you think that checking if there is an r.squared column is glanced is sufficient, or should I add a new column of information to glance, such as class(object)[1]? Or maybe the call should be saved as an attribute of tidy.mipo(object)? Here's what I do now: https://github.com/vincentarelbundock/mice/blob/tidymipo/R/pool.r.squared.R#L73

@stefvanbuuren
Copy link
Member

  1. The standard error after Fisher transform is 1/sqrt(N - 3), and that's what I've used. The df.residual parameter depends on model complexity. I am not aware of any work that compares both possibilities. For the moment, I would stick to the "official" Fisher formulation, and thus use the "3".

  2. Although some may disagree with me, pooling R^2 is a general procedure that is not tied to the linear model, so I think there is no need to know the model. For now, I would just pool any r.squared and r.squared.adjusted values, irrespective of where they come from. Also, here it would be interesting to study this a little deeper.

@stefvanbuuren
Copy link
Member

For those who are interested in the problem of pooling R^2, also see Van Ginkel, JR (2020). Standardized regression coefficients and newly proposed estimators for R^2 in multiply imputed data. Psychometrika, https://doi.org/10.1007/s11336-020-09696-4

@vincentarelbundock
Copy link
Contributor Author

Thanks or the reference. Looks interesting.

I believe that commit 59f1ad4 wraps things up for this PR.

Of course, I'm happy to change stuff if you find anything you dislike during code review.

@stefvanbuuren
Copy link
Member

Vincent, thanks a lot. Almost there. Found three things:

  1. DESCRIPTION needs a Depends: generics

  2. Import needed:

  pool: no visible global function definition for ‘residuals’
  summary.mira: no visible global function definition for ‘residuals’
  Undefined global functions or variables:
    residuals
  Consider adding
    importFrom("stats", "residuals")
  to your NAMESPACE file.
  1. The weird error checking still occurs with test-pool.r.squared.R and test-tidiers.R. Start a fresh session, then library(mice) and library(testthat) and run the test code. This produces errors, but devtools::check() does not find them. Very curious.

Could you take a look it these?

@vincentarelbundock
Copy link
Contributor Author

Forgetting to run devtools::check(). A rookie mistake!

I fixed problems 1 and 2 and force-pushed to keep the history a bit cleaner.

W.r.t. problem 3, the issue seems to be that the imputed data produced by mice::mice interactively differs from the imputed data produced under devtools::test.

Here's the most minimal example I can produce. First, run this interactively from a fresh session to look at the seeds:

imp <- mice::mice(mice::nhanes, maxit = 2, m = 2, seed = 1, print = FALSE)
imp$lastSeedValue[1:5]
#> [1]      10403        141 -705880619 1904110533 1951823516

Then, save this in a file called mice/tests/testthat/test-seed.R:

context("seed issue")

imp <- mice::mice(mice::nhanes, maxit = 2, m = 2, seed = 1, print = FALSE)

test_that('interactive and devtools::test seeds differ',{
    interactive <- c(10403L, 141L, -705880619L, 1904110533L, 1951823516L)
    expect_equal(imp$lastSeedValue[1:5], interactive)
})

Finally, open a fresh session of R, and run devtools::test(). On my machine, the result is:

test-seed.R:7: failure: interactive and devtools::test seeds differ
imp$lastSeedValue[1:5] not equal to `interactive`.
2/5 mismatches (average diff: 5003)
[1] 403 - 10403 == -10000
[2] 135 -   141 ==     -6

This shows that the first 2 elements of imp$lastSeedValue are different when we run interactively in a fresh session or from devtools::test.

Note that this problem persists if I checkout and install 9184f07, which is your last commit before I made any changes.

I tried to find the problem, but I don't understand the internals of mice well enough to figure it out. For instance, in the mice() function, we assign lastSeedValue to .Random.seed, but that object is not assigned in that function at all.

https://github.com/stefvanbuuren/mice/blob/master/R/mice.R#L397

  • Why is it only the first two elements lastSeedValue that differ?
  • Could there be seed "bleeding" between calls when several runs of mice are conducted in a test suite?
  • Why does the parlmice test fail when I run devtools::test() twice in a single session?

@stefvanbuuren
Copy link
Member

  1. I was able to reproduce your findings. The imp$lastSeedValue is used in mice.mids() to continue iteration.

  2. The following script suggests that devtools::test() and interactive R use different random generators

context("seed issue")

set.seed(1L)
test_that('interactive and devtools::test seeds differ (2)',{
  interactive <- c(10403L, 624L, -169270483L, -442010614L, -603558397L)
  expect_equal(.Random.seed[1:5], interactive)
})

Test fails. See also https://stackoverflow.com/questions/56104176/devtoolstest-rngversion-and-sample-yields-different-results. I searched devtools issues for "set.seed" and "RNG", but didn't find it. Do you think it's worth opening an issue?

  1. I do not (yet) know where the parlmice problem stems from.

  2. Your code looks fine, will merge.

@stefvanbuuren stefvanbuuren merged commit 2618de9 into amices:master Jun 4, 2020
@stefvanbuuren
Copy link
Member

I now found the possible culprit. In test-pool.R I use

# set the random generator to V3.5.0 to ensure that this test 
# passes in V3.6.0 and later
# see mail Kurt Hornik, dated 06mar19
# FIXME: consider using the new generator once V3.6.0 is out, 
# at the expense of breaking reproducibility of the examples in 
# https://stefvanbuuren.name/fimd/
suppressWarnings(RNGversion("3.5.0"))

I expect updating this will solve it. No devtools problem.

@vincentarelbundock
Copy link
Contributor Author

RNGs still feel a bit like black magic to me, so I'm glad you were able to track down the problem.

Thanks for being so efficient and reactive. This PR was a breeze. And thanks for your work on mice more generally!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants