misc additions to discussion

bcjaeger · Nov 3, 2020 · 759d2f5 · 759d2f5
1 parent 9eccde5
commit 759d2f5
Showing 1 changed file with 17 additions and 5 deletions.
diff --git a/doc_arXiv/doc_arXiv.Rmd b/doc_arXiv/doc_arXiv.Rmd
@@ -85,9 +85,13 @@ In this section, we describe the INTERMACS Registry in Section \ref{subsec:inter
 ## INTERMACS Registry
 \label{subsec:intermacs}
 
-The INTERMACS data is publicly available on biolincc at https://biolincc.nhlbi.nih.gov/studies/intermacs/. INTERMACS is a North American observational registry for patients receiving MCS devices that began as a partnership between the National Heart, Lung, and Blood Institute, US Food and Drug Administration, the Centers for Medicaid and Medicare Services, industry, and individual hospitals with the mission of improving MCS outcomes. In 2018, INTERMACS became an official Society of Thoracic Surgeons database.
+The INTERMACS data is publicly available on biolincc at https://biolincc.nhlbi.nih.gov/studies/intermacs/. INTERMACS is a North American observational registry for patients receiving MCS devices that began as a partnership between the National Heart, Lung, and Blood Institute, US Food and Drug Administration, the Centers for Medicaid and Medicare Services, industry, and individual hospitals with the mission of improving MCS outcomes. In 2018, INTERMACS became an official Society of Thoracic Surgeons database. 
 
-The current analysis was conducted using publicly available data provided by the National Heart, Lung, and Blood Institute. We included a contemporary cohort of `r table_value(nrow(im))` patients who received continuous flow LVAD from `r min(im$im_impl_yr)`-`r max(im$im_impl_yr)`. Patient follow-up begins after implantation of a durable, long term MCS device and continues while the device is in place. Registry endpoints include death on a device, heart transplantation, or cessation of support (for recovery and non-recovery reasons). INTERMACS collects pre-implant patient characteristics, medical status, laboratory values, and many other variables. Data is collected at regularly scheduled follow-up as well as during adverse events such as re-hospitalization. This is a secondary analysis of de-identified data obtained from the National Heart Lung and Blood Institute. Primary data collection is approved through University of Alabama Institutional Review Board and at individual sites. 
+The current analysis was conducted using publicly available data provided by the National Heart, Lung, and Blood Institute. We included a contemporary cohort of `r table_value(nrow(im))` patients who received continuous flow LVAD from `r min(im$im_impl_yr)`-`r max(im$im_impl_yr)`. This is a secondary analysis of de-identified data obtained from the National Heart Lung and Blood Institute. Primary data collection is approved through University of Alabama Institutional Review Board and at individual sites. 
+
+## Outcomes and Predictors
+
+Patient follow-up begins after implantation of a durable, long term MCS device and continues while the device is in place. Registry endpoints include death on a device, heart transplantation, or cessation of support (for recovery and non-recovery reasons). Mortality and transplant after MCS were the primary outcomes for the current study. As there were only `r sum(im$pt_outcome_cess)` cessation of support events, we did not analyze this outcome. INTERMACS collects pre-implant patient characteristics, medical status, laboratory values, and many other variables. Data is collected at regularly scheduled follow-up as well as during adverse events such as re-hospitalization. For the current analysis, all `r ncol(im) - 4` pre-implant variables were considered as potential predictors.
 
 ## Statistical Inference and Learning with Missing data 
 \label{subsec:inference_and_learning}
@@ -318,9 +322,17 @@ Adjusting for the amount of additional missing data amputed and the outcome vari
 
 # Discussion
 
-In this article, we leveraged data from the INTERMACS registry to evaluate how the use of different imputation strategies prior to fitting a risk prediction model would impact the external prognostic accuracy of the model. External prognostic accuracy was measured at `r times` months after receiving MCS to focus on short term risk prediction, and the primary measure of accuracy was the scaled Brier score. We evaluated the performance of 12 imputation strategies in a broad range of settings by varying the type of modeling algorithm, the outcome variable, and the amount of additional missing data amputed prior to performing imputation. Our resampling experiment indicated that conducting multiple imputation has a high likelihood of increasing the downstream scaled Brier score and C-index of risk prediction models compared with imputation to the mean. Additionally, multiple imputation with random forests emerged as the imputation strategy that maximized the probability of developing a more prognostic model compared with imputation to the mean. 
+In this article, we leveraged INTERMACS registry data to evaluate how the use of different imputation strategies prior to fitting a risk prediction model would impact the external prognostic accuracy of the model. External prognostic accuracy was measured at `r times` months after receiving MCS, and the primary measure of accuracy was the scaled Brier score. We evaluated the performance of 12 imputation strategies in a broad range of settings by varying (1) the amount of additional missing data amputed prior to performing imputation, (2) the type of risk prediction model applied after imputation, and (3) the outcome variable for the risk prediction model. Our resampling experiment indicated that conducting multiple imputation has a high likelihood of increasing the downstream scaled Brier score and C-index of risk prediction models compared with imputation to the mean. Additionally, multiple imputation with random forests emerged as the imputation strategy that maximized the probability of developing a more prognostic model compared with imputation to the mean. 
+
+In previous studies involving the INTERMACS data registry, imputation to the mean has been applied prior to developing a mortality risk prediction model \cite{hsich2012should, cotts2014predictors, eckman2011survival, kirklin2017eighth, kormos2019society}. An interesting recent study indicates that imputation to the mean can provide an asymptotically consistent prediction model, given the prediction model is flexible and non-linear. However, theoretical results for finite samples have not yet been established. Our results provide relevant data for the finite sample case, suggesting that using imputation strategies considered in the current study instead of imputation to the mean can improve the prognostic accuracy of downstream models, particularly if multiple imputation is applied.
+
+Previous research has also established evidence in favor of applying multiple imputation to improve the prognostic value of risk prediction models. For example, Hassan and Atiya demonstrated superior downstream prediction using an ensemble multiple imputation method on synthetic data with continuous outcomes \cite{hassan2007regression}. Similarly, Nanni et. al demonstrated superior performance in downstream prediction when missing values were imputed using their proposed ensemble multiple imputation method \cite{nanni2012classifier}. Notably, the authors artificially induced missing values in these studies and the largest real dataset that was evaluated contained less than 700 observations. An article by Jerez et. al evaluated missing data strategies based on the downstream task of fitting a neural network and predicting early breast cancer relapse \cite{jerez2010missing}. The authors found that KNN imputation led to risk prediction models with the highest discrimination and lowest calibration error. Results from the current study are consistent with these previous findings but also extend their results by providing evidence from a larger source of data (\ie INTERMACS) and dealing with `real-world' missing values. 
+
+Others have previously evaluated imputation techniques based on the accuracy with which these techniques impute missing values in the training data \cite{tutz2015improved, little2013joys, steele2018machine}. While it is intuitive to hypothesize that more accurate imputation will provide more prognostic downstream models, our results do not support this supposition. For example, when an additional 30\% of missing data were amputed, none of the missing data strategies we implemented obtained higher accuracy than imputation to the mean. However, using \emph{any} of the multiple imputation strategies we considered instead of imputation to the mean increased prognostic accuracy of downstream models when an additional 30\% of missing data were amputed. This result is likely explained by the bias variance tradeoff. In particular, single imputation techniques may lead to prediction models with lower bias but higher variance than multiple imputation techniques.  
+
+\paragraph{Strengths and limitations} The current analysis has a number of strengths. We leveraged the INTERMACS data registry, comprising one of the largest cohorts of patients who received MCS. We applied a well known resampling method to internally validate modeling algorithms for risk prediction. Last, we made all source code for our analysis available in a public repository (see the first author's Github). Last, the approach presented in this paper provides a general framework that can be applied to risk prediction models in other longitudinal studies. The current analysis should also be interpreted in the context of known limitations. We considered a small subset of existing strategies to impute missing data, and other strategies may have provided stronger improvements compared with imputation to the mean. Also, we were not able to use only the training data to impute missing values in the testing data. Although the \texttt{miceRanger} package allows imputation of new data using existing models, few software packages for imputation allow users to implement multiple imputation with this protocol. 
 
-In previous studies involving the INTERMACS data registry, imputation to the mean has been applied prior to developing a mortality risk prediction model \cite{hsich2012should, cotts2014predictors, eckman2011survival, kirklin2017eighth, kormos2019society}. The current analysis has shown that the prognostic accuracy of these models can be increased by implementing a different strategy to impute missing values, particularly if multiple imputation is applied. However, few studies have investigated best practices on the use of multiple imputation for prediction models, and few software packages allow users to implement multiple imputation procedures that adhere to best practices for prediction model development (\eg using only the training data to impute missing values). Further development of open source software that allows investigators to safely and succinctly implement multiple imputation for prediction models (\eg the \texttt{miceRanger} package) may increase the presence of multiple imputation in published risk prediction models.
+\paragraph{Conclusion} Selecting an optimal strategy to impute missing values can impact the prognostic accuracy of downstream risk prediction models. In the current analysis, conducting multiple imputation using random forests emerged as an optimal strategy to impute missing values in the INTERMACS data. This investigation can directly inform future analyses of INTERMACS data and provide evidence quantifying the benefit of imputing missing data with sound methodology. 
 
 
 
@@ -367,7 +379,7 @@ tbl_missingness %>%
 
 \clearpage
 
-```{r, results = 'asis'}
+```{r tbl_impute_accuracy, results = 'asis'}
 
 tbl_impute_accuracy %>% 
   select(md_method:numeric_mi) %>%