Skip to content

Commit

Permalink
add figure and references to the figures and tables
Browse files Browse the repository at this point in the history
  • Loading branch information
moben1 committed May 8, 2024
1 parent 6b533c3 commit 0ca5b20
Show file tree
Hide file tree
Showing 2 changed files with 62 additions and 41 deletions.
14 changes: 14 additions & 0 deletions paper/paper.bib
Original file line number Diff line number Diff line change
Expand Up @@ -105,6 +105,20 @@ @article{dorogush2018catboost
year={2018}
}

@article{breiman2001randomforest,
author = {Breiman, Leo},
title = {Random Forests},
journal = {Machine Learning},
volume = {45},
number = {1},
pages = {5--32},
year = {2001},
month = {Oct},
doi = {10.1023/A:1010933404324},
issn = {1573-0565},
url = {https://doi.org/10.1023/A:1010933404324},
}

@article{hall2000correlation,
title={Correlation-based feature selection of discrete and numeric class machine learning},
author={Hall, Mark A},
Expand Down
89 changes: 48 additions & 41 deletions paper/paper.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,34 +60,36 @@ efficiency.

## Preprocessing

The preprocessing phase in **qsarKit** begins with feature selection [@Comesana2022], aimed at enhancing model
performance and interpretability. Features demonstrating low variance across the dataset are eliminated, as they
contribute minimal predictive power. This selection is based on the variance threshold technique, ensuring that only
features contributing significantly to model diversity are retained. Additionally, to address the challenge of
multicollinearity[@remeseiro2019review], a correlation-based feature selection is applied using rank correlation
coefficient of *Kendall* [@prematunga2012correlational]. Features exceeding this threshold are systematically
removed [@hall2000correlation]. Further refinement is achieved through Recursive Feature Elimination (
*RFE*) [@guyon2002gene], a process that systematically reduces the feature set to those most significant for model
prediction, thereby improving both model interpretability and performance.
The preprocessing phase in **qsarKit** (as illustrated in the figure\autoref{fig:pipeline}) begins with feature
selection [@Comesana2022], aimed at enhancing model performance and interpretability. Features demonstrating low
variance across the dataset are eliminated, as they contribute minimal predictive power. This selection is based on the
variance threshold technique, ensuring that only features contributing significantly to model diversity are retained.
Additionally, to address the challenge of multicollinearity[@remeseiro2019review], a correlation-based feature selection
is applied using rank correlation coefficient of *Kendall* [@prematunga2012correlational]. Features exceeding this
threshold are systematically removed [@hall2000correlation]. Further refinement is achieved through Recursive Feature
Elimination (*RFE*) [@guyon2002gene], a process that systematically reduces the feature set to those most significant
for model prediction, thereby improving both model interpretability and performance.

## Data Augmentation Using Generative Adversarial Networks

To counter the prevalent issue of limited and imbalanced QSAR datasets, **qsarKit** employs a GAN for data augmentation.
This approach addresses the shortcomings of traditional datasets by generating new, plausible molecular structures,
thereby expanding the diversity and size of the training set [@decao2018molgan]. The GAN module comprises a
*Featurizer*, which prepares molecular structures in SMILES format for processing, followed by the GAN itself, which
trains on available data to produce new molecular structures. The generated structures are then converted back into
quantitative features through the *Descriptor Extraction* process, making them suitable for subsequent QSAR modeling.
To counter the prevalent issue of limited and imbalanced QSAR datasets, **qsarKit** employs a GAN\autoref{fig:pipeline}
for data augmentation. This approach addresses the shortcomings of traditional datasets by generating new, plausible
molecular structures, thereby expanding the diversity and size of the training set [@decao2018molgan]. The GAN module
comprises a *Featurizer*, which prepares molecular structures in SMILES format for processing, followed by the GAN
itself, which trains on available data to produce new molecular structures. The generated structures are then converted
back into quantitative features through the *Descriptor Extraction* process, making them suitable for subsequent QSAR
modeling.

## Model Training and Optimization

<!-- todo add the sklearn model references -->
**qsarKit** supports six core models, including both regression and ensemble methods, tailored for QSAR analysis. This
selection grants users the flexibility to choose the most appropriate model for their data and objectives. Model
training in **qsarKit** is rigorously evaluated using cross-validation techniques, ensuring the models' generalization
capabilities to unseen data. Special emphasis is placed on maintaining the original distribution of chemical properties
and response variables through strategic binning and stratification, thereby preserving the integrity and
representativeness of the dataset.
**qsarKit** supports six core models\autoref{models}, including both regression and ensemble methods, tailored for QSAR
analysis. This selection grants users the flexibility to choose the most appropriate model for their data and
objectives. Model training in **qsarKit** is rigorously evaluated using cross-validation techniques, ensuring the
models' generalization capabilities to unseen data. Special emphasis is placed on maintaining the original distribution
of chemical properties and response variables through strategic binning and stratification, thereby preserving the
integrity and representativeness of the dataset.

: Notations and variables used in **qsarKit**. []{label="notations"}

| Variable | Notation |
|-------------------------|:----------------------------:|
Expand All @@ -99,46 +101,51 @@ representativeness of the dataset.
| Regularisation function | $\Omega$ |
| Decision trees | $f_k$ |

| Models | Equation |
|-----------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|
| Ridge Regression [@hoerl2000ridge] | ${\displaystyle J(\theta) = \sum_{i=1}^n (y_i-\hat {y_i})^2 + \lambda \sum_{j=1}^p \theta_j^2}$ |
| Lasso Regression [@tibshirani1996lasso] | ${\displaystyle J(\theta) = \sum_{i=1}^n (y_i-\hat {y_i})^2+\lambda \sum_{j=1}^{p}\vert\theta_{j}\vert}$ |
| Elasticnet [@tay2021elasticnet] | ${\displaystyle J(\theta) =\sum_{i=1}^n (y_i-\hat {y_i})^2 + \lambda_{1}\sum_{j=1}^p\vert\theta_j \vert+\lambda_{2}\sum_{j=1}^p\theta_j^{2}}$ |
| Random Forest | ${\displaystyle {\hat {y}}={\frac {1}{m}}\sum _{j=1}^{m}\sum _{i=1}^{n}W_{j}(x_{i},x')\,y_{i}=\sum _{i=1}^{n}\left({\frac {1}{m}}\sum _{j=1}^{m}W_{j}(x_{i},x')\right)\,y_{i}}$ |
| XGBoost [@chen2016xgboost] | ${\displaystyle J(\theta) = \sum_{i=1}^n L(y_i,\hat {y_i}) + \sum_{k=1}^K \Omega(f_k)}$ |
| CatBoost [@dorogush2018catboost] | ${\displaystyle J(\theta) = \frac{1}{n}\sum_{i=1}^n (y_i-\hat {y_i})^2}$ |
: Available models in **qsarKit** and their respective loss functions. []{label="models"}

| Models | Equation |
|------------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|
| Ridge Regression [@hoerl2000ridge] | ${\displaystyle J(\theta) = \sum_{i=1}^n (y_i-\hat {y_i})^2 + \lambda \sum_{j=1}^p \theta_j^2}$ |
| Lasso Regression [@tibshirani1996lasso] | ${\displaystyle J(\theta) = \sum_{i=1}^n (y_i-\hat {y_i})^2+\lambda \sum_{j=1}^{p}\vert\theta_{j}\vert}$ |
| Elasticnet [@tay2021elasticnet] | ${\displaystyle J(\theta) =\sum_{i=1}^n (y_i-\hat {y_i})^2 + \lambda_{1}\sum_{j=1}^p\vert\theta_j \vert+\lambda_{2}\sum_{j=1}^p\theta_j^{2}}$ |
| Random Forest [@breiman2001randomforest] | ${\displaystyle {\hat {y}}={\frac {1}{m}}\sum _{j=1}^{m}\sum _{i=1}^{n}W_{j}(x_{i},x')\,y_{i}=\sum _{i=1}^{n}\left({\frac {1}{m}}\sum _{j=1}^{m}W_{j}(x_{i},x')\right)\,y_{i}}$ |
| XGBoost [@chen2016xgboost] | ${\displaystyle J(\theta) = \sum_{i=1}^n L(y_i,\hat {y_i}) + \sum_{k=1}^K \Omega(f_k)}$ |
| CatBoost [@dorogush2018catboost] | ${\displaystyle J(\theta) = \frac{1}{n}\sum_{i=1}^n (y_i-\hat {y_i})^2}$ |

To optimize model performance, **qsarKit** employs *Optuna* for systematic hyperparameter tuning, leveraging Bayesian
optimization techniques to explore the parameter space efficiently [@akiba2019optuna]. This process tries to identify
the optimal settings for each QSAR model converging to an optimal set of hyperparameters.

## Integrated Pipeline

At its core, **qsarKit** is designed as a modular and comprehensive pipeline, encapsulating the entire QSAR modeling
process from initial data preprocessing to final prediction and evaluation. The pipeline allows for the seamless
integration of data augmentation, model training, and evaluation, supporting a range of evaluation metrics including
$R^2$, $Q^2$, and $RMSE$ to assess model performance accurately. The modularity of the package permits users to engage
with specific components individually or utilize the entire pipeline for end-to-end processing, accommodating diverse
research needs and objectives in the QSAR domain.
At its core, **qsarKit** is designed as a modular and comprehensive pipeline \autoref{fig:pipeline}, encapsulating the
entire QSAR modeling process from initial data preprocessing to final prediction and evaluation. The pipeline allows for
the seamless integration of data augmentation, model training, and evaluation, supporting a range of evaluation metrics
\autoref{metrics}, including $R^2$, $Q^2$, and $RMSE$ to assess model performance accurately. The modularity of the
package permits users to engage with specific components individually or utilize the entire pipeline for end-to-end
processing, accommodating diverse research needs and objectives in the QSAR domain.

: Evaluation metrics used in **qsarKit**. []{label="metrics"}

| Evaluation metrics | Equation |
|--------------------------------|:-----------------------------------------------------------------------------------------------------------------:|
| Coefficient of Determination R | $R^2 = 1 - \frac{\sum_{i=1}^n (y_i-\hat {y_i})^2}{\sum_{i=1}^n (y_i-\overline y_i)^2}$, where $y_i \in D_{train}$ |
| Coefficient of Determination Q | $Q^2 = 1 - \frac{\sum_{i=1}^n (y_i-\hat {y_i})^2}{\sum_{i=1}^n (y_i-\overline y_i^2}$, where $y_i \in D_{test}$ |
| Root Mean Square Error | $RMSE = \sqrt{\frac{\sum_{i=1}^n (y_i-\hat {y_i})^2}{N}}$ |

<!-- todo pipeline containing all steps -->
![qsarKit pipeline. \label{fig:pipeline}](qsarKit_h.png)

# Application and Results: QSAR Modeling in the Breastfeeding Context

The **qsarKit** package has been specifically designed and applied to address a significant healthcare question: deliver
a framework for the prediction of chemical transfer ratios from maternal plasma to breast milk, a crucial consideration
for breastfeeding mothers' and infants' health. This application underscores the importance of understanding and
predicting the Milk-to-Plasma concentration ratio [@anderson2016], denoted as
$$
\begin{equation}\label{eq:mp_ratio}
M/P_{ratio} = \frac{AUC_{milk}}{AUC_{plasma}}
$$ where $AUC_{milk}$ and $AUC_{plasma}$ are the areas under the curve of the concentration of a molecule in the
maternal in the plasma repectively. Which represents the extent to which various pharmaceutical drugs and environmental
\end{equation}
where $AUC_{milk}$ and $AUC_{plasma}$ are the areas under the curve of the concentration of a molecule in the
maternal in the plasma respectively. Which represents the extent to which various pharmaceutical drugs and environmental
chemicals can transfer into breast milk.

## Data Nature and Contextual Background
Expand All @@ -147,7 +154,7 @@ The foundational data employed in this study originate from a diverse set of mol
and environmental chemicals, contextualized within the breastfeeding scenario. The primary focus is on the quantitative
prediction of the $M/P_{ratio}$, which is necessary for assessing the safety and exposure risks of breastfeeding infants
to these substances [@verstegen2022]. By applying **qsarKit** to this domain, we aim to effectively provide a free
framework to help in the dmain of breastfeeding research.
framework to help in the domain of breastfeeding research.

## Dataset Composition and Distribution

Expand Down

0 comments on commit 0ca5b20

Please sign in to comment.