Missing Data Analysis Using Multiple Imputation
Getting to the Heart of the Matter
Missing data are a pervasive problem in health investigations. We describe some background of missing data analysis and criticize ad hoc methods that are prone to serious problems. We then focus on multiple imputation, in which missing cases are first filled in by several sets of plausible values to create multiple completed datasets, then standard complete-data procedures are applied to each completed dataset, and finally the multiple sets of results are combined to yield a single inference. We introduce the basic concepts and general methodology and provide some guidance for application. For illustration, we use a study assessing the effect of cardiovascular diseases on hospice discussion for late stage lung cancer patients.
The empirical basis of health services and outcomes research largely rests on statistical analysis of data collected in studies. However, it is typical that not all planned observations are made. The reasons for missing data are numerous. Subjects may have missed a visit for a practical or administrative reason, or data may not have been collected on a particular time because of equipment failure. Subjects may drop out from studies because side effects associated with the treatment prohibited them from continued study participation, or they may not report their health outcome because they are too sick. Missing data can also arise from study designs. For example, different survey forms may be used in one study, and therefore some variables in one survey are collected for some but not all patients.
Missing data can be a serious impediment for data analysis. For example, Huskamp et al1 investigated patterns of hospice discussion with providers by patients with late-stage lung cancer. They used data collected from a multisite cohort study of care for patients with lung or colorectal cancer by the Cancer Care Outcomes Research and Surveillance (CanCORS) Consortium.2 As is typical in large health or social studies, there exists a substantial amount of missing data in CanCORS database, and the missing cases display no systematic pattern. In our illustrative example, in which the hospice study data are used and the analytic goal is to use a logistic regression to assess the effect of the patients’ cardiovascular disease status on their tendency of hospice discussion, the fractions of missing observations range from 0.04% to 19.48% for the variables, including both the outcome and predictors. Simply removing the patients with missing cases from the analysis would result in a loss of around 30% of the sample, raising serious concerns about the validity of the results.
In this report, we review the multiple imputation3 approach to missing data problems in the context of cross-sectional data analysis. The next section introduces some background, then multiple imputation is discussed, followed by the hospice study example to illustrate the methods, and the final section concludes with a discussion.
Missing Data Mechanism
Table 1 shows a few lines of the dataset used in the hospice study example. Typically, a data set in analytic form can be characterized as a rectangular matrix (row=subjects, column=variables), and the missing data are the elements that we do not observe in this matrix, marked by question marks in Table 1.
The missingness pattern of a dataset can be represented by a missing indicator matrix of 1’s and 0’s shaped like the data matrix, with 1’s for missing values and 0’s for observed ones. We think of missingness as consequences of a random process that can be characterized by missingness models. For example, a model relating missingness of myocardial infarction to other variables in the dataset may suggest that older patients with stroke are more likely to have nonresponse. Three broad types of missingness mechanisms,4 moving from the simplest to the most general, are:
Missing completely at random (MCAR): A variable is MCAR if the probability of missingness is independent of any characteristics of the subjects. For example, each survey respondent decides whether to answer the “age” question by rolling a die and refusing to answer it if a “1” appears (ie, with a probability of 1/6). However, most missingness is not completely random. In the hospice study, for example, older patients are more likely than younger ones to have nonresponse on either income or insurance questions.
Missing at random (MAR): A more general assumption, MAR, is that the probability a variable is missing depends only on observed variables. For instance, older patients might be more likely to miss “insurance” than younger patients, and then “insurance“ is MAR if the study has collected information on age for all patients in the survey.
Not missing at random (NMAR): Missingness is no longer “at random” if its probability depends on variables that are incomplete. A common example is that people with higher income are less likely to reveal them, that is, the nonresponse probability for the income variable depends on values that can be missing.
Understanding of which class the missing data mechanism falls into is key to making correct statistical inferences. MAR can never be proved or falsified using data alone, as NMAR assumption asserts that something is not available to the researchers from the observed data. It is possible, however, to test if data are MCAR in many situations: if meaningful differences exist between those with and without missing data for some variables, this provides evidence against MCAR.
Under MAR (including MCAR as a special case), we can ignore the missingness models and focus on the missing-data models, which describe the predictive relationship between the incomplete variables and observed ones (eg, a model relating missing myocardial infarction to other variables in the dataset may indicate that older patients are more likely to have myocardial infarction). Under NMAR, however, missingness models generally must be specified to obtain the correct inferences,3,5 although using MAR models including more variables may achieve close results.6–8
Ad Hoc Missing Data Methods
A common missing data approach is complete-case analysis (CC), which uses only subjects who have all variables observed and is also the default option in many statistical software. When data are MCAR, CC analysis results are unbiased. When data are MAR but not MCAR, it is permissible to exclude the missing observations, provided that a regression model controls for all the variables that affect the probability of missingness.9 However, CC analysis generally has major deficiencies.5,10 The results can be biased when data are not MCAR. In addition, the reduction of statistical power by discarding cases is a major drawback. For example, suppose data are MCAR across 20 variables and the missingness fraction is 5% for each variable. Using CC analysis will lose close to two thirds of the subjects because the fully observed subjects only account for (1% to 5%)20 ≈36% of the original sample.
Ad Hoc Imputation
Imputation methods fill in missing values to maintain the full sample so that standard software can be easily used to analyze completed data. In addition, the researcher using the imputed data can concentrate on substantive questions of interest rather than incomplete-data problems.11
However, many ad hoc imputation methods (eg, mean imputation and treating missing data as a separate category) are based on missing data models with implausible assumptions. Furthermore, these methods impute the missing data only once and then proceed to the completed data analysis. These single imputation strategies generally underestimate the standard errors of estimates because choosing a single imputation pretends that we know the unobserved value with certainty, when actually it is unknown but estimated by the imputation method.
Some Principled Approaches
Nonresponse weighting12 is a principled approach for making the subjects included in the analysis representative of the original sample. For example, suppose that 100% of whites and 50% of blacks responded in a survey. If there are large differences between whites and blacks in the variable of interest, then the sample mean from the observed cases would be biased from the average of the complete data. Assuming MCAR for blacks, weighting their observations by 2(=1/.5), that is, each respondent represents 2 cases from the original sample, and then calculating the weighted average of the observed cases would obtain a more accurate estimate. In more general scenarios, the weights are the inverse of the predicted probabilities of response estimated from the missingness models of incomplete variables.
Weighting might be best suitable for unit nonresponse in a survey (ie, cases sampled for the survey but not participating in an interview, such as noncontacts and refusers). On the other hand, by including only subjects with complete data, it ignores partial information from subjects with incomplete data and can thus lead to reduction of efficiency. In addition, weighting becomes considerably less tractable with multiple missing variables when there is no regular pattern for missing data.13,14 Furthermore, because weights are estimated from the proposed models, this extra level of prediction will introduce more uncertainty to the inference. Sometimes extreme estimates of weights (if the predicted probabilities are close to 0 or 1) can lead to erratic variance estimates.
Another principled approach is to maximize the likelihood (ML) function of incomplete data, with the missing data values removed from the complete-data likelihood by a process of summation or integration (Appendix). The resulting parameter estimates are most efficient because all observed data are used. Principles and examples for applying ML to incomplete-data problems can be found in Reference 5.5
In many cases, incomplete likelihood functions typically have a complicated form; special computational techniques such as EM algorithm15 may be needed to maximize them. Computational aspects of ML with missing data are reviewed by Schafer.16 Typically special software must be developed for a particular problem, given the fact that ML is usually problem-specific. Thus, the technical difficulties involved in constructing a likelihood model and carrying out computation is less appealing for most practitioners.
Under MAR, the multiple imputation3 approach seeks to retain the advantages of ML estimates while also allowing the uncertainty caused by imputation, which is ignored in single imputation, to be incorporated into the completed-data analysis. It involves creating more than 1 set of replacements for the missing values based on plausible models for data, therefore generating multiple completed datasets for analysis (Figure). The statistical reasoning behind multiple imputation is that the observed-data likelihood can be approximated by the average of the completed-data likelihood over unknown missing values (Appendix). That is, multiple imputation analysis that combines the likelihood-based analysis from each completed dataset is approximately equivalent to the analysis based on the observed-data likelihood, whereas the imputation uncertainty is reflected by the variation across the multiple completed datasets.
The analysis of multiply imputed data proceeds as follows:
Analyze each completed dataset separately using a suitable software package designed for complete data (eg, SAS, STATA, or R).
Extract the point estimate and standard error from each analysis.
Combine the multiple sets of point estimates and standard errors to obtain a single point estimate, standard error, and the associated confidence interval or probability value.
The combining rules in step 3 contain some formulas for calculating the average of the estimates across multiple imputations and the variances of the estimates, both within and between imputations (Appendix). They have been incorporated into imputation packages (see the Software section below) for automatic calculations.
Plausible imputation should give reasonable predictions for the missing data, and the variability among them must reflect an appropriate degree of uncertainty. Rubin3 recommends that imputations be created through bayesian arguments: Specify a parametric model for the complete data under MAR, assume a prior distribution for the unknown model parameters, and simulate multiple independent draws from the conditional distribution of missing values given observed data by Bayes theorem. A simple example for univariate missing outcome is given in the Appendix.
Various imputation models have been developed within more general and complicated contexts. See Buuren17 for a summary and references. In general, the strategy of building imputation models falls into 2 categories:
Joint modeling. The joint modeling approach partitions the observations into groups of identical missing data patterns and imputes the missing entries with each pattern according to a joint model for the variables that is common to all observations. Some classic examples include multivariate normal models for continuous variables, log-linear models for categorical variables, general location models for a mixture of continuous and categorical variables,16 and mixed-effects models for repeated measurements or multilevel data.18,19 These methods start by specifying a parametric multivariate density for the data given model parameters. Under an appropriate prior distribution for the parameters, it is possible to derive the appropriate submodel for each missing data pattern, from which imputations are drawn. The joint modeling approach is theoretically sound but may lack the flexibility needed to represent complex data structures arising in many studies. For example, the CanCORS data consist of a large number of variables having a variety of distributional forms, subject to certain logical or consistency bounds imposed by study questionnaires, and displaying unsystematic missingness patterns. In such a case, the joint modeling strategy is difficult to implement because the typical specifications of multivariate distributions are not sufficiently flexible to accommodate these features.
Sequential regression multiple imputation (SRMI)20,21 (also referred to as the multiple imputation by chained equations). In SRMI, multivariate data are characterized by separate conditional models for each incomplete variable. That is, the imputation model is specified separately for each variable, with other variables as predictors. At each step of the SRMI algorithm, imputations are generated for the missing values of 1 variable; these imputed values are then used in the imputation of the next variable, and this process repeats until it reaches convergence. Compared with the joint modeling approach, an appealing feature of SRMI is that it is relatively easy to accommodate complex data features in univariate regression models. Constructing these regression models can follow common guidelines of regression modeling applied to the data at hand. For continuous variables, the model may involve a linear regression model or its robust extensions.22 Dichotomous variables may be modeled by logistic regression and categorical variables with more than 2 categories by polytomous models. Poisson models can be used for incomplete count data and 2-part models for a variable with a mixture of point mass and continuous values. Detailed information can be found in the manuals for the related software.
Some popular imputation software includes:
SAS: PROC MI uses regression methods and propensity scores for imputation. PROC MIANALYZE combines estimates output from various complete-data procedures.
S-plus: The missing data library supports different models for multivariate normal (“impGauss”), categorical variables (“impLogin”), and the conditional gaussian (“impCgm”) for imputation involving both continuous and categorical variables.
R: It supports libraries such as “norm,” “cat,” “mix,” and “pan” for imputing data under multivariate normal models, log-linear models, general location models, and linear mixed models, respectively. In addition, libraries including “mi” and “Hmisc” impute data in more complex scenarios and provide tools for diagnostics.
IVEware: Imputation and Variance Estimation software for SRMI, callable by SAS (http://www.isr.umich.edu/src/smp/ive).
MICE: Multiple Imputation by Chained Equations, library available in both S-plus and R (http://web.inter.nl.net/users/S.van.Buuren/mi/html/mice.htm).
ICE: SRMI library available in STATA.
Descriptions of other imputation software and more comprehensive reviews appear in References 23 through 25.23–25
Checking of imputation models is important because it can identify model defects and facilitate model improvement. As in complete-data analysis, one possible strategy is to check regression modeling assumptions such as normality and homoscedasticity of the regression residuals on the incomplete data. Graphical diagnostics can be used26,27 (see also R library “mi”). More advanced bayesian strategies assess the similarity between observed data and their replicates drawn from the imputation model.28 Sensitivity analysis under different imputation models is also helpful.
This section summarizes some of the key steps involved in a typical multiple imputation project for practitioners.
Understand the analytic objective and identify the data structure and study design.
Make appropriate assumptions for missing data mechanism.
Identify variables to be included in imputation. The general strategy is to include at least all variables involved in the planned analysis. For example, when imputing missing predictors, the outcome variables should be included in imputation to retain the association between the outcome and predictors. In addition, variables not used in the analysis yet having strong correlation with incomplete variables might be included.
Construct the imputation model. It is important to seek a balance between sophistication and feasibility of models. For most empirical analyses, we recommend using existing models in the literature or those provided by available software.
Use the appropriate imputation package for implementation.
Carry out imputation diagnostics and sensitivity analysis.
Postimputation data processing. For example, imputed values might be outside the range of observed data, making rounding and truncation necessary.29,30
Combine completed-data estimates from multiple datasets and report the results.
Flag the imputations in the completed data for better reference.
Study Background: CanCORS
The CanCORS consortium is funded by the National Cancer Institute and the Veteran’s Administration to examine services and outcomes of care delivered to population-based cohorts of diagnosed patients from 2003 to 2005 with lung and colorectal cancer in multiple regions of the country. It consists of 7 study sites. Each site identified appropriate samples to obtain combined cohorts of approximately 5000 patients diagnosed with each cancer. CanCORS collected data from multiple sources including patient surveys and medical records. The database contains information about the care received during different stages of illness, including diagnosis, treatment, surveillance for recurrent disease, and palliation, as well as data on various clinical and patient-reported outcomes and patient preferences and behaviors.
Missing Data Problem in Hospice Study
Huskamp et al1 examined patterns of cancer hospice care, which include a broad array of palliative and support services for individuals with terminal illness. It identified patient characteristics and preferences that are associated with patient reports in the baseline survey and medical records that they had discussed hospice with a care provider. The outcome variable is patients’ hospice discussion, and predictors include patients’ clinical and sociodemographic characteristics. Particularly, patients’ comorbidity scale variable, which summarizes the severity of their coexisting aliments, was included as a predictor.
We use this study to construct a simplified illustrative example concerning the association between patients’ cardiovascular disease variables and hospice discussion. These predictors, including myocardial infarction, heart failure, stroke, and diabetes, were obtained from the baseline survey but were not used in the original analysis from Huskamp et al.1 The study subsample (n=2474) consists of all patients with advanced lung cancer (stage IIIB or IV). Table 2 describes the variables from the analytic subsample; some have a substantial amount of missing data, and the missing items exhibit no systematic pattern.
Logistic Regression Analysis With Missing Data
The substantive analysis is a logistic regression for hospice discussion, and predictors include all other variables in the subsample. We carry out a multiple imputation analysis, using the SRMI strategy implemented in IVEware. In this dataset, all the variables are categorical and some are ordinal (eg, income, education, and age). In IVEware, we classify all of the variables involved in the imputation as categorical, and thus binary or general logit models are used to fit each conditional regression model for imputation. We chose to present the results from imputed data after running the program for 5 iterations to achieve convergence. We also applied the CC analysis and the missing data indicator method as the ad hoc approaches for comparative purposes.
Table 3⇓ shows the results from each method. The regression estimates from CC and SRMI are somewhat different, and the latter produces smaller standard errors than the former for all regressors, illustrating the superior efficiency in the multiple imputation analysis. At the 5% level, predictors associated with Hispanic ethnicity, divorced/separated marital status, and age 81+ group are nonsignificant under CC but significant under SRMI, whereas the predictor associated with a history of myocardial infraction (significant under CC) becomes nonsignificant under SRMI. In this case, CC discards close to 30% of the subjects. When the assumption of MCAR is violated, as in our example, CC removes cases in a nonrandom fashion and could distort the joint distribution among the variables. As a result, it could both bias point estimates and indicate standard errors and thus misidentify significant predictors. The results from the missing data indicator method are overall similar to those from SRMI, although the former also discards around 4% of the subjects with the missing outcome variable (ie, hospice discussion).
From the substantive point of view, the multiple imputation analysis results do not appear to suggest a significant association between cardiovascular disease status of patients with late-stage lung cancer and their tendency to talk about hospice. This is consistent with the report by Huskamp et al,1 in which the original analysis did not identify patients’ comorbidity as a significant predictor of hospice discussion.
To assess the fit of the SRMI models used, we performed posterior predictive checking28 to examine the deviation of analysis results of interest (ie, logistic regression coefficients and their standard errors), computed from the completed data with imputations, from the same quantities calculated from simulated copies of the completed data under the model. Large deviations would indicate model inadequacy for the targeting analysis. Our model assessment shows that the deviation is small (results not shown), suggesting that the SRMI models are adequate for the logistic regression analysis.
In addition, we carried out some sensitivity analysis using alternative modeling strategies. When using the SRMI, another modeling option is to treat income, education, and age as continuous to capture the underlying ordering of these variables. Their corresponding conditional regression models are thus linear normal models. After rounding the continuous imputations to the nearest allowed integer values, the logistic regression analysis results (not shown) are similar to those from the option treating all variables as categorical. We also applied the joint modeling strategy using a general location model. Specifically, we treated race, marital status, and insurance as nominal variables and assumed that they follow a log-linear model with conditional independence. We treated other variables (binary or ordinal) as continuous with multivariate normal distributions conditional on the categorical variables and rounded the imputations before the completed-data analysis. This approximation to the joint distribution16 was implemented using the library “mix” in R. Estimates for the logistic model (not shown) are also similar to those obtained using the SRMI strategy. The sensitivity analysis results increase confidence on our missing data inferences.
In this review, we focus on missing data problems and multiple imputation for cross-sectional regression analysis assuming MAR. Methods have been developed for more complicated designs (eg, longitudinal or spacial studies) or missingness mechanism (ie, NMAR). The relevant discussion is beyond the scope of this review, and some of the topics can be found in References 31 and 32.31,32 In addition, many other analytic problems can be viewed and solved from the perspective of incomplete data and multiple imputation. Examples include causal inferences on potential outcomes, measurement error problems, and confidential use of a public database.33,34
In our opinion, multiple imputation is a principled and practical approach to missing data problems. This approach involves an initial investment in multiple imputing of the missing values. After multiple imputation, complete-data software can then be used to repeatedly analyze the completed datasets, extract the point estimates and their standard errors, and combine them using simple rules. Though this method requires additional storage and extra steps of repeated analysis and combining estimates, in the grand theme of health services and outcomes investigators, it is a minor step, especially owing to the availability of software for creating multiple imputations and performing analysis.
The multiple imputation approach can be used for a single researcher analyzing a particular incomplete dataset for a unique goal. It also fits well for a setting involving a large dataset with multiple researchers using different portions of the dataset for various aims.35 In such a scenario, imputation by the data producer allows the incorporation of specialized knowledge about the reasons for missing data in the imputation procedure, including confidential information that cannot be released to the public or other variables in the imputation process that may not be used in substantive analysis by a particular researcher. Moreover, the nonresponse problem is solved in the same way for all users so that analyses will be consistent across users. Related examples include imputation projects for the Fatal Accident Reporting System,36 census industry and occupation codes,37 the National Health and Nutrition Examination Survey,16 the National Health Interview Survey,38 and CanCORS.39
We encourage investigators and practitioners keep themselves updated about the current development of the multiple imputation methods and software. However, it is important that missing data be considered not solely a data analysis problem but also a study design and implementation issue. That is, we shall strive to prevent missing data in the first place.
The author thanks Alan M. Zaslavsky and Sharon-Lise T. Normand for helpful suggestions.
Sources of Funding
This work was supported by the grant U01-CA93344 from the National Cancer Institute.
The online-only Data Supplement is available at http://circoutcomes.ahajournals.org/cgi/content/full/3/1/98/DC1.
Ayanian JZ, Chrischilles EA, Fletcher RH, Fouad MN, Harrington DP. Understanding cancer treatment and outcomes: the Cancer Care Outcomes Research and Surveillance Consortium. J Clin Oncol. 2003; 22: 2292–2296.
Rubin DB. Multiple Imputation for Nonresponse in Surveys. New York: Wiley; 1987.
Rubin DB. Inference and missing data (with discussion). Biometrika. 1976; 63: 581–592.
Little RJA, Rubin DB. Statistical Analysis of Missing Data. 2nd ed. New York: Wiley; 2002.
Schafer JL. Multiple imputation: a primer. Stat Methods Med Res. 1999; 8: 3–15.
Cochran WG. Sampling Techniques. New York: Wiley; 1977.
Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm (with discussion). J R Stat Soc Series B (Statistical Methodology). 1977; 39: 1–38.
Schafer JL. Analysis of Incomplete Multivariate Data. London: Chapman and Hall; 1997.
van Buuren S. Multiple imputation of discrete and continuous data by fully conditional specification. Stat Methods Med Res. 2007; 16: 219–242.
Raghunathan TE, Lepkowski JM, VanHoewyk J, Solenberger P. A multivariate technique for multiply imputing missing values using a sequence of regression models. Survey Methodol. 2001; 27: 85–95.
Yu LM, Burton A, Riverto-Arias O. Evaluation of software for multiple imputation of semicontinuous data. Stat Methods Med Res. 2007; 16: 243–258.
Gelman AE, Carlin JB, Stern HS, Rubin DB. Bayesian Data Analysis. 2nd ed. London: Chapman and Hall; 2004.
Molenberghs G, Kenward MG. Missing Data in Clinical Studies. West Sussex: Wiley; 2007.
Daniels MJ, Hogan JW. Missing Data in Longitudinal Studies: Strategies for Bayesian Modeling and Sensitivity Analysis. Boca Raton, FL: Chapman and Hall; 2008.
Gelman AE, Meng XL. Applied Bayesian Modeling and Causal Inference from Incomplete Data Perspective. New York: Wiley; 2004.
Rubin DB. Multiple imputations in sample surveys: a phenomenological Bayesian approach to nonresponse. Proc Survey Research Methods Section of the American Statistical Association. 1978; 1: 20–34.
Heitjan DF, Little RJA. Multiple imputation for the Fatal Accident Reporting System. J R Stat Soc Series C (Applied Statistics). 1991; 40: 13–29.
Schenker N, Treiman DJ, Weidman L. Analyses of public use decennial census data with multiply imputed industry and occupation codes. J R Stat Soc Series C (Applied Statistics). 1993; 42: 545–556.
He Y, Zaslavsky AM, Harrington DP, Catalano P, Landrum MB. Multiple imputation in a large-scale complex survey: a practical guide. Stat Methods Med Res. 2009 Aug 4 [Epub ahead of print].