Data driven recommendation of an imputation algorithm

DIMA

Egert et al. (2020), DIMA: Data-driven recommendation of an imputation algorithm, unpublished

Short description

Missing value imputation is a crucial and essential step in a proteomics analysis pipeline and downstream analyses are highly affected by this processing step. To facilitate the decision for a well-performing imputation method, a novel concept for data-driven recommendation of a high-performing imputation algorithm is presented. DIMA has the advantage that it combines the learning of the individual pattern of MVs in a given data set with the testing of many different imputation strategies to suggest the best-performing algorithm for the specific input data.

Method

DIMA consists of five main steps:

The pattern of missing values in the original data I is learned by logistic regression analysis.
A data matrix K is defined as a subset of the original data I without or only few MVs to generate a reference data set to evaluate imputation performance. K should contain at least 50 proteins.
To generate a pattern of missing data with the same distribution as in the original dataset, the logistic regression model of step 1 is applied to the known data set K. Bernoulli trials are performed to simulate different patterns S of MVs. By default n=5 different patterns are generated.
Various imputation algorithms are applied to the data subset K with patterns S of MVs and ranked by their root mean square error (RMSE). The best-performing algorithm is defined by the lowest mean rank over the simulated patterns S of MVs.
The best-performing imputation algorithm of step 4 is recommended as imputation algorithm for the original data set I and imputation of I is performed.

Usage

An OmicsData object is created by

O = OmicsData(file);

where .xls, .txt and .mat files as well as a numeric input are accepted, e.g. the MaxQuant output tables can serve as file inputs here. DIMA is executed via

O = DIMA(O,[algorithms],[bio]);

Either the default imputation algorithms are used or they can be specified by the user. DIMA currently comprises 30 imputation algorithms from 13 R-packages. A fast version, which only runs the ten most frequently recommended algorithms based on 167 PRIDE data sets, is also implemented and available by setting algorithms= 'fast', but should be used with caution. The optional third input argument is a flag if additional biological information - if available in the input data file - should be taken into account. After applying DIMA the suggested algorithm and the respective imputation are saved in the proteomics data object:

algorithm = get(O,'DIMA');

data = get(O,'data');

Package installation

DIMA performs imputation via Rlink and all R packages used for imputation have to be installed beforehand:

packages <-c('Rtools','R.matlab','amap','mice','norm','Amelia','Hmisc','imputeLCMD','missForest','softImpute','VIM','rrcovNA','missMDA','mi','DMwR','GMSimpute')

for (i in 1:length(packages)){

install.packages(packages[i], dependencies=TRUE, repos='http://cran.rstudio.com/')

}

install.packages("BiocManager")

BiocManager::install("pcaMethods")

BiocManager::install("impute")

install.packages("https://cran.r-project.org/src/contrib/Archive/imputation/imputation_1.3.tar.gz", repos=NULL, type='source')

Provide feedback

Saved searches