- This package facilitates developing Quantitative Structure-Activity Relationship (QSAR) models using the SEND database. It streamlines data acquisition, pre-processing, organ wise toxicity score calculation, descriptor calculation, and model evaluation, enabling researchers to efficiently explore molecular descriptors and create robust predictive models.
- Detailed descriptions of each function are available in the “Articles” section of the GitHub-hosted website.
- Automated Data Processing: Simplifies data acquisition and pre-processing steps.
- Comprehensive Analysis: Provides z-score calculations for various parameters such as body weight, liver-to-body weight ratio, and laboratory tests.
- Machine Learning Integration: Supports classification modeling, hyperparameter tuning, and performance evaluation.
- Visualization Tools: Includes but not limited to histograms, bar plots, and AUC curves for better data interpretation.
- Input Database Path: Provide the path for database or
.xpt
files containing nonclinical study data inSEND
format. - Data Pre processing: Use functions
f1
tof8
to clean, harmonize, and prepare data for Machine Learning (ML). - Model Building: Employ ML functions (
f9
tof18
) for ML model training and evaluation. - Visualization: Generate plots and performance metrics for better
interpretation (
f12
tof15
). - Automated Pipelines: Use functions
f15
tof18
to perform the above workflows in A single step by providing the database path and a.csv
file containing the label (TOXIC/NON-TOXIC) of theSTUDYID
.
-
Liver Toxicity Score Calculation for Individual
STUDYID
:f1
:get_compile_data
- Fetches structured data from the specified database path.f2
:get_bw_score
- Calculates body weight z-scores for each animal (depends onf1
).f3
:get_livertobw_zscore
- Computes liver-to-body weight z-scores(depends onf1
).f4
:get_lb_score
- Calculates z-scores for laboratory test results(depends onf1
).f5
:get_mi_score
- Computes z-scores for microscopic findings(depends onf1
).
-
Liver Toxicity Score Calculation and Aggregation for Multiple
STUDYID
:f6
:get_liver_om_lb_mi_tox_score_list
- Combines z-scores for LB, MI, and liver-to-BW ratio into a single data frame.
- Internally calls
f1
tof5
.
-
Machine Learning Data Preparation:
-
f7
:get_col_harmonized_scores_df
- Harmonizes column names across columns for consistency from the data frame (depends onf6
). -
f8
:get_ml_data_and_tuned_hyperparameters
- Prepares data and tunes hyper parameters for machine learning (depends onf7
).
-
-
Machine Learning Model Building and Performance Evaluation:
- Model Training
f9
:get_rf_model_with_cv
- Builds a random forest model with cross-validation (depends on
f8
).
- Builds a random forest model with cross-validation (depends on
- Improved Classification Accuracy
f10
:get_zone_exclusioned_rf_model_with_cv
- Enhances classification accuracy by excluding uncertain
predictions (depends on
f8
).
- Enhances classification accuracy by excluding uncertain
predictions (depends on
- Feature Importance
f11
:get_imp_features_from_rf_model_with_cv
- Computes feature importance for model interpretation.
- Model Performance Visualization
f12
:get_auc_curve_with_rf_model
- Generates AUC curves to evaluate model performance.
- Model Training
- Data Preparation
- Functions
f1
tof8
must be executed sequentially to prepare theData
argument required by these functions. - Alternatively, the composite function
f18
can be used to directly generate theData
argument, combining the functionality off1
tof8
. - For
f9
,f10
,f11
, andf12
, Functionsf1
,f2
,f3
,f4
,f5
,f6
,f7
, andf8
must be executed sequentially to prepare theData
argument. Alternatively, the composite functionf18
can be used to directly generate theData
argument.
- Functions
-
Combine multiple modular functions for complex operations.
-
Visualization and Reporting :
f13
:get_histogram_barplot
- Creates bar plots for target variable classes (depends on functionsf1
tof8
).f14
:get_reprtree_from_rf_model
- Builds representative decision trees (depends on functionsf1
tof8
)..f15
:get_prediction_plot
- Visualizes prediction probabilities with histograms(depends on functionsf1
tof8
)..
f16
:get_Data_formatted_for_ml_and_best.m
- Creates machine learning-ready data by executing
f1
tof8
-Formats data for ML pipelines. - Provide the same result as
f8
by merging functionality of functions fromf1
tof7
- Creates machine learning-ready data by executing
f17
:get_rf_input_param_list_output_cv_imp
- Automates pre-processing, modeling, and evaluation.
- Provide the same result as
f9
by merging functionality of functions fromf1
tof8
f18
:get_zone_exclusioned_rf_model_cv_imp
- Similar to
f17
but excludes uncertain predictions. - Provide the same result as
f10
by merging functionality of functions fromf1
tof8
- Optional argument for hyperparameter tuning.
- Similar to
h1
:get_treatment_group_&_dose
- Retrieve treatment groups from thetx
domain.h2
: -get_repeat_dose_parallel_studyids
- Retrieves
STUDYID
s for repeat dose and parallel study designs. - Optional filtering for “rat” species studies.
- Retrieves
fid1
:get_all_LB_TESTCD_score
- Calculates scores for eachLBTESTCD
based onget_lb_score
.fid2
:get_indiv_score_om_lb_mi_domain_df
- Returns domain-specific scores including liver-to-BW ratio, LB, and MI scores.
randomForest
ROCR
ggplot2
reprtree
# Install from GitHub
devtools::install_github("aminuldu07/SENDQSAR")
library(SENDQSAR)
data <- get_compile_data("/path/to/database")
bw_scores <- get_bw_score(data)
liver_scores <- get_livertobw_zscore(data)
model <- get_rf_model_with_cv(data, n_repeats=10)
print(model$confusion_matrix)
get_histogram_barplot(data, target_col="target_variable")
Contributions are welcome! Feel free to submit issues or pull requests via GitHub.
This project is licensed under the MIT License - see the LICENSE file for details.
For more information, visit the project GitHub Page or contact [email protected].