Skip to main content

Table 11 Method for imputation of missing data: Multiple imputation

From: Statistical analysis of high-dimensional biomedical data: a gentle introduction to analytical goals, common approaches and challenges

Multiple imputation

 Multiple imputation is a widely used approach for handling missing data under the MAR scenario. It uses a regression model based on the available variables to predict the missing values. In an iterative fashion, missing values of a specific variable are predicted using a regression model that depends on the other observed variables, and the resulting predicted value is used in the main regression model. To account for the uncertainty in the imputation, multiple imputed datasets are generated and then analyzed, and the results are summarized according to “Rubin’s rule” [61]. Software for multiple imputation is widespread in major statistical packages. As described above, for HDD, before applying multiple imputation, often a pre-selection of variables is advisable

 Future directions for HDD analysis include a more detailed look at MAR settings (as all procedures provided so far are fully justified only when the MCAR assumption is tenable), the addition of auxiliary information for specifying the imputation model, and development of analysis methods that can directly cope with missing values, such as robust PCA and random forests. The best method depends also on the analysis goal, such as cluster analysis or developing a prediction model