Merge overleaf-2021-10-31-1014 into main

pni-lab · Oct 31, 2021 · 16e9845 · 16e9845
2 parents 049f379 + dec8913
commit 16e9845
Showing 1 changed file with 9 additions and 9 deletions.
diff --git a/manuscript.tex b/manuscript.tex
@@ -59,12 +59,12 @@ \section{Introduction}
 
 Spurious, out-of-interest associations between the predictor variables (features) and the prediction target can be detrimental to the model's biomedical validity and generalizability. This phenomenon is often called confounding bias\citep{prosperi2020causal}. Confounding bias can be driven by - among others - measurement artifacts (e.g motion artifacts in magnetic resonance imaging-based predictive models of Alzheimer's\citep{rao2017predictive}, attention deficit hyperactivity disorder\citep{eloyan2012automated, couvy2016head} or Autism Spectrum Disorder (ASD)\citep{gotts2013perils, spisak2014voxel, spisak2019optimal}), demographic and psychometric variables (e.g models trained to predict intelligence\citep{ cole2012global, he2020deep} might provide a statistically significant predictive performance by picking up solely on age-related variance\citep{dubois2018distributed, lohmann2021predicting}), sampling bias and stochastic group differences (e.g. racially biased machine learning models\citep{ obermeyer2019dissecting, lwowski2021risk}) as well as batch effects or, in multi-center studies, center-effects.
 
-While various data cleaning methods might help in mitigating confounding bias\citep{rao2017predictive, dukart2011age, spisak2014voxel, abdulkadir2014reduction, johnson2007adjusting}, it is often unclear which variables should be considered as confounders and such approaches hold risks of eliminating signal-of-interest\citep{wachinger2021detect}.
+While various data cleaning methods may help in mitigating confounding bias\citep{rao2017predictive, dukart2011age, spisak2014voxel, abdulkadir2014reduction, johnson2007adjusting}, it is often unclear which variables should be considered as confounders and such approaches hold risks of eliminating signal-of-interest\citep{wachinger2021detect}.
 
 Powerful and robust statistical tests for quantifying confounding bias in predictive models could substantially foster both the identification of confounders to correct for and the assessment of the effectiveness of various confound-mitigation approaches. It is tempting to think about confounding bias as the \emph{conditional dependence} of the model predictions on the confounder, given the target variable. However, the proper evaluation of conditional independence among these variables is challenging. Namely, even in the presence of a slight non-normality and/or non-linearity of the involved conditional distributions, the 'conditional' analogs of the most popular bivariate non-parametric tests (like the partial Spearman correlation, see Fig. \ref{fig:sim-h0-demo}) are not valid measures of conditional independence. Although warnings about this issue were given from early on\citep{korn1984ranges}, and received a fair amount of attention recently\citep{bergsma2010nonparametric, candes2016panning, peters2016causal,  shah2020hardness, berrett2020conditional}, the magnitude of the problem may not be fully appreciated in case of predictive model diagnostics, where non-normality and non-linearity of the model output can be frequently seen, as a consequence of e.g. feature-set characteristics and model regularization\citep{garcia2009study, kristensen2017whole} (see Supplementary Material \ref{sup:nomlinviol} for a simplistic example).
 
 Recently, two different approaches were proposed for quantifying confounding bias \citep{chaibub2019permutation, ferrari2020measuring}. However, these methods either fail to control type I error (as known in the case of balanced permutations\cite{southworth2009properties, hemerik2018exact}, used in ref.\cite{chaibub2019permutation}), or do not provide p-values at all\cite{ferrari2020measuring}. 
-Moreover, without some modifications, they are only applicable for categorical variables and involve re-fitting the model, which might not be feasible for models with high computational cost (e.g. when trained with nested cross-validation).
+Moreover, without some modifications, they are only applicable for categorical variables and involve re-fitting the model, which may not be feasible for models with high computational cost (e.g. when trained with nested cross-validation).
 
 This work aims to construct a statistical test for confounding bias that (i) guarantees valid type-I error control for arbitrary models, even if non-normal and/or non-linear dependencies are involved (ii) does not require re-fitting the model, (iii) is applicable for classification as well as for prediction problems and both with numerical and categorical confounders.
 
@@ -97,7 +97,7 @@ \subsection{The partial and full confounder tests}
 The concept of conditional independence provides a straightforward framework for assessing confounding bias in predictive models. However, handling the non-normal and/or non-linear conditional dependencies often seen in predictive models\citep{garcia2009study, kristensen2017whole} poses a great challenge.
 In fact, as recently shown by Shah and colleagues in their 'no free lunch' theorem\cite{shah2020hardness}, it is effectively impossible to establish a \emph{fully non-parametric} conditional independence test with a valid type I error control and a non-trivial power. Indeed, perhaps somewhat surprisingly, but not totally unexpectedly\cite{korn1984ranges} - partial correlation-like analogs of widely used bivariate non-parametric test, like partial Spearman correlation, exhibit inflated type I errors even with slight violations of normality and/or linearity (as clearly demonstrated with simulated data on Fig. \ref{fig:sim-h0-demo}). Such tests are therefore poor choices for testing confounding bias in machine learning.
 
-As, in terms of its conditional distribution on the others, the model output is clearly the most intractable from the three involved variables (target, prediction, confounder)\citep{garcia2009study, kristensen2017whole}, a method being distribution-free only for this variable might already provide a sufficient robustness for predictive model diagnostics. Exactly this can be achieved with the proposed approach, which extends the novel framework of conditional permutation testing (CPT)\cite{berrett2020conditional} with conditional distribution estimation via generalized additive (GAM)\citep{hastie1987generalized} or multinomial logistic models (mnlogit)\cite{bennett1966multiple, jones1975proability}. The proposed approach offers two novel tests for probing confounding bias: the \emph{full confounder test} probes whether the model's predictive performance can be exclusively attributed to the confounder and the \emph{partial confounder test} investigates whether the model utilizes any confounder-variance in the predictions, when controlled for the target variable.
+As, in terms of its conditional distribution on the others, the model output is clearly the most intractable from the three involved variables (target, prediction, confounder)\citep{garcia2009study, kristensen2017whole}, a method being distribution-free only for this variable may already provide a sufficient robustness for predictive model diagnostics. Exactly this can be achieved with the proposed approach, which extends the novel framework of conditional permutation testing (CPT)\cite{berrett2020conditional} with conditional distribution estimation via generalized additive (GAM)\citep{hastie1987generalized} or multinomial logistic models (mnlogit)\cite{bennett1966multiple, jones1975proability}. The proposed approach offers two novel tests for probing confounding bias: the \emph{full confounder test} probes whether the model's predictive performance can be exclusively attributed to the confounder and the \emph{partial confounder test} investigates whether the model utilizes any confounder-variance in the predictions, when controlled for the target variable.
 These tests place no assumptions on the conditional distributions of the model predictions, ensuring valid model diagnostics even in cases of non-normally and non-linearly dependent predictions.
 
 The inner workings of the \emph{partial confounder test} are summarized on Fig. \ref{fig:overview}. In short, the test models the conditional distribution between the confounder and the target variable with a GAM - or with an \emph{mnlogit} regression, in case of categorical confounder - and then uses a so-called parallel-pairwise Markov-chain Monte-Carlo sampler\cite{berrett2020conditional} that draws permutations of the confounder, so that the permuted variables still comply with the estimated conditional distribution. The test statistic (coefficient of determination, $R^2$) is then computed between the model predictions and the original, as well as the permuted variables. The original and the permuted test statistics construct the p-value as the ratio of permuted test statistics more extreme than the original. The \emph{full confounder test} works in an analogous way, with the difference that it creates permuted copies of the target variable, instead of the confounder.
@@ -233,7 +233,7 @@ \section{Discussion}
 The proposed approach gives rise to two different statistical tests for testing confounding bias: the \emph{full confounder test} probes whether the model's predictive performance can be attributed exclusively to the confounder and the \emph{partial confounder test} investigates whether the model utilizes any confounder-variance in the predictions, when controlled for the target variable. 
 The tests can be applied for arbitrary classification or regression models, without having to re-fit the model, that is, with a negligible extra computational cost.
 
-As expected from theory, both tests displayed a valid type I error control and a high, practically relevant statistical power in the simulations, even if both the predictions and the confounder are non-normally and/or non-linearly dependent on the target variable (except by extreme non-normality). This result confirms that the tests can be deployed in a wide variety of predictive modelling scenarios. While different biomedical applications might consider different amounts of bias to be relevant, the presented results can serve as a basis for power calculations, in order to identify the necessary sample size for proper model diagnostics.
+As expected from theory, both tests displayed a valid type I error control and a high, practically relevant statistical power in the simulations, even if both the predictions and the confounder are non-normally and/or non-linearly dependent on the target variable (except by extreme non-normality). This result confirms that the tests can be deployed in a wide variety of predictive modelling scenarios. While different biomedical applications may consider different amounts of bias to be relevant, the presented results can serve as a basis for power calculations, in order to identify the necessary sample size for proper model diagnostics.
 
 A characteristic example for the potential areas of applications is the novel field of "predictive neuroscience", where applying predictive modelling and machine learning on functional neuroimaging data holds great potential for both revolutionizing our understanding of the physical basis of mind and delivering clinically useful tools for diagnostics or therapeutic decision making\citep{woo2017building, wager2013fmri, spisak2020pain}. However, the presence of confounders that are typical for biomedical research (e.g. sample demographics, center-effects) or specific to the data acquisition and processing approach (e.g. imaging artifacts) presents a great challenge to these efforts.
 The usefulness of the proposed tests is demonstrated in two such examples, using the HCP\citep{van2013wu}  and the ABIDE\citep{di2014autism} datasets.
@@ -255,8 +255,8 @@ \section{Discussion}
 
  In sum, the application of the \emph{partial} confounder test on the real data examples suggests that confounding bias should be much more carefully investigated and reported in studies utilizing predictive modelling and machine learning as (i) variables as trivial as the date of the acquisition can cause significant confounding bias and (ii) in certain situations, a sufficient mitigation of confounder-bias requires more effective solutions than feature regression or COMBAT and (iii) in some cases confounder mitigation can - paradoxically - introduce more bias. The partial confounder test can be considered as a useful, objective benchmark to guide the search for a suitable confounder mitigation approach for every dataset.
 
- Regarding the \emph{full} confounder test; while the real data examples in this paper were not typical cases of its potential applications, it has still provided important insights into model biases. Namely, in the case of the aforementioned paradox motion-regression model in the ABIDE dataset, the full confounder test did not provide any evidence that the model captures any extra variance over that explained by the confounder ($p=0.1$). In other words, the test accepted the null hypothesis of full model bias, suggesting that this model might be, in fact, almost exclusively driven by motion artifacts.
-This example highlights that the full confounder test might become useful in exploratory phases of model development, where models might be still severely biased by various confounders and the questions is whether there is any biomedically relevant signal captured by the model, at all.
+ Regarding the \emph{full} confounder test; while the real data examples in this paper were not typical cases of its potential applications, it has still provided important insights into model biases. Namely, in the case of the aforementioned paradox motion-regression model in the ABIDE dataset, the full confounder test did not provide any evidence that the model captures any extra variance over that explained by the confounder ($p=0.1$). In other words, the test accepted the null hypothesis of full model bias, suggesting that this model may be, in fact, almost exclusively driven by motion artifacts.
+This example highlights that the full confounder test may become useful in exploratory phases of model development, where models might be still severely biased by various confounders and the questions is whether there is any biomedically relevant signal captured by the model, at all.
 
 
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@@ -269,7 +269,7 @@ \section{Conclusion}
 The tests have, moreover, a minimal computational overhead, as re-fitting the model is not required.
 
 As demonstrated on functional brain connectivity-based predictive models of fluid intelligence and autism spectrum disorder, the tests can guide the optimization of the confound mitigation strategy and allow quantitative statistical assessment of the robustness, generalizability and neurobiological validity of predictive models in biomedical research.
-Given their simplicity, robustness, wide applicability, high statistical power and computationally effective implementation (available in the  python package \emph{mlconfound}\footnote{\href{https://mlconfound.readthedocs.io}{https://mlconfound.readthedocs.io}}), the partial and full confounder tests emerge as novel tools in the methodological arsenal of predictive modelling and might largely accelerate the development of clinically useful machine learning biomarkers.
+Given their simplicity, robustness, wide applicability, high statistical power and computationally effective implementation (available in the  python package \emph{mlconfound}\footnote{\href{https://mlconfound.readthedocs.io}{https://mlconfound.readthedocs.io}}), the partial and full confounder tests emerge as novel tools in the methodological arsenal of predictive modelling and may largely accelerate the development of clinically useful machine learning biomarkers.
 
 \newpage
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@@ -283,7 +283,7 @@ \subsection{Notation and Background}
 
 Depending on the research question, the direct influence of $\c$ on the model predictions $\yhat$ is subject to be kept at a negligible level or it is to be proven that, at least, the model is not completely driven by $\c$.
 
-Obviously, a strong association between $\yhat$ and $\c$ might indicate that the model is biased; its predictions are driven by the confounder rather than information about the target variable.
+Obviously, a strong association between $\yhat$ and $\c$ may indicate that the model is biased; its predictions are driven by the confounder rather than information about the target variable.
 Assessing the simple bivariate (unconditioned) dependence ($H0: \yhat \independent \c$) between $\yhat$ and $\c$ (or any of the $\y$, $\yhat$, $\c$ variables) is, however, insufficient for the proper characterization of confounder-bias in predictive modelling.
 For instance, even if $\yhat \independent \c$ is false, $\yhat$ might be only marginally dependent on $\c$, due to the dependence of both on $\y$. In other words, if the target variable $\y$ displays a true association to the confounder variable $\c$, a model that is completely blind to $\c$ (i.e not confounded at all) might still provide outputs $\yhat$ that are significantly associated with $\c$.
 
@@ -420,7 +420,7 @@ \subsubsection*{Conditional independence for testing confounding bias}
 \caption{\label{tab:conditional-independence-cases} Possibilities when testing conditional independence in potentially biased predictive models. \\The table lists the three possible null hypotheses (H0), and the variables where assumption about the joint/conditional distributions is required/not required.   ($\y$: prediction target, $\yhat$: predictions, $\c$: confounder variable) }
 \end{table}
 
-Option 3, i.e. partial confounder testing is typically of interest when testing confounding bias of predictive models. Option 1, i.e. full confounder testing, might be also useful in diagnostics of predictive models, especially in the exploratory phase of model construction. Option 2 seems less appealing for model diagnostics and importantly, in this case the proposed variety of the CPT framework does not allow constructing a test which is non-parametric on $\yhat$.
+Option 3, i.e. partial confounder testing is typically of interest when testing confounding bias of predictive models. Option 1, i.e. full confounder testing, may be also useful in diagnostics of predictive models, especially in the exploratory phase of model construction. Option 2 seems less appealing for model diagnostics and importantly, in this case the proposed variety of the CPT framework does not allow constructing a test which is non-parametric on $\yhat$.
 
 In the following section, CPT is adapted for \emph{partial} confounder testing (option 3) and extended with general additive model\cite{hastie1987generalized} (GAM) and multinomial logistic regression\citep{bennett1966multiple, jones1975proability} based conditional distribution estimations, in order to make it handle categorical data and non-linear dependencies between the confounder and the target variable. (For an overview of the method, see Fig. \ref{fig:overview}). The formulation of the \emph{full} confounder test (option 1) is analogous and given in Supplementary Material \ref{sup:full-test}.