-
Notifications
You must be signed in to change notification settings - Fork 0
Data driven recommendation of an imputation algorithm
Egert et al. (2020), DIMA: Data-driven recommendation of an imputation algorithm, unpublished
Missing value imputation is a crucial and essential step in a proteomics analysis pipeline and downstream analyses are highly affected by this processing step. To facilitate the decision for a well-performing imputation method, a novel concept for data-driven recommendation of a high-performing imputation algorithm is presented. DIMA has the advantage that it combines the learning of the individual pattern of MVs in a given data set with the testing of many different imputation strategies to suggest the best-performing algorithm for the specific input data.
DIMA consists of five main steps:
- The pattern of missing values in the original data I is learned by logistic regression analysis.
- A data matrix K is defined as a subset of the original data I without or only few MVs to generate a reference data set to evaluate imputation performance. K should contain at least 50 proteins.
- To generate a pattern of missing data with the same distribution as in the original dataset, the logistic regression model of step 1 is applied to the known data set K. Bernoulli trials are performed to simulate different patterns S of MVs. By default n=5 different patterns are generated.
- Various imputation algorithms are applied to the data subset K with patterns S of MVs and ranked by their root mean square error (RMSE). The best-performing algorithm is defined by the lowest mean rank over the simulated patterns S of MVs.
- The best-performing imputation algorithm of step 4 is recommended as imputation algorithm for the original data set I and imputation of I is performed.
An OmicsData object is created by
O = OmicsData(file);
where .xls, .txt and .mat files as well as a numeric input are accepted, e.g. the MaxQuant output tables can serve as file inputs here. DIMA is executed via
O = DIMA(O,[algorithms],[bio]);
Either the default imputation algorithms are used or they can be specified by the user. DIMA currently comprises 30 imputation algorithms from 13 R-packages. A fast version, which only runs the ten most frequently recommended algorithms based on 167 PRIDE data sets, is also implemented and available by setting algorithms= 'fast', but should be used with caution. The optional third input argument is a flag if additional biological information - if available in the input data file - should be taken into account. After applying DIMA the suggested algorithm and the respective imputation are saved in the proteomics data object:
algorithm = get(O,'DIMA');
data = get(O,'data');
DIMA performs imputation via Rlink and all R packages used for imputation have to be installed beforehand:
packages <-c('Rtools','R.matlab','amap','mice','norm','Amelia','Hmisc','imputeLCMD','missForest','softImpute','VIM','rrcovNA','missMDA','mi','DMwR','GMSimpute')
for (i in 1:length(packages)){
install.packages(packages[i], dependencies=TRUE, repos='http://cran.rstudio.com/')
}
install.packages("BiocManager")
BiocManager::install("pcaMethods")
BiocManager::install("impute")
install.packages("https://cran.r-project.org/src/contrib/Archive/imputation/imputation_1.3.tar.gz", repos=NULL, type='source')