diff --git a/R/get_7col_harmonized_scores_df.R b/R/get_col_harmonized_scores_df.R similarity index 71% rename from R/get_7col_harmonized_scores_df.R rename to R/get_col_harmonized_scores_df.R index 3571aa3..1b991f6 100644 --- a/R/get_7col_harmonized_scores_df.R +++ b/R/get_col_harmonized_scores_df.R @@ -1,20 +1,36 @@ - -#' @title get_random_forest_model -#' @param data_frame Mandatory, character \cr -#' Studyid number -#' @param Liver_get_liver_om_lb_mi_tox_score_list Mandatory, character \cr -#' Studyid number -#' @param not_Liver_get_liver_om_lb_mi_tox_score_list Mandatory, character \cr -#' path of database -#' @return score +#' @title get_col_harmonized_scores_df +#' @description +#' This function harmonizes liver score data by cleaning column names, +#' replacing missing values with zeros, and optionally rounding specific columns. +#' The function also identifies and harmonizes synonyms, removes unnecessary columns, +#' and reorders the data based on column sums. +#' +#' @param liver_score_data_frame A data frame containing liver score data. +#' This data frame should have column names that may require harmonization. +#' @param Round A logical value indicating whether the data should be rounded. +#' If TRUE, certain liver-related columns are floored and capped, and histology-related columns are ceiled. Default is FALSE. +#' +#' @details +#' The function performs the following operations: +#' - Harmonizes column names by replacing spaces, commas, and slashes with dots. +#' - Replaces missing values (NA) with zero. +#' - Identifies and harmonizes synonym columns, replacing their values with the higher value between the synonyms. +#' - Removes specific unwanted columns such as 'INFILTRATE', 'UNREMARKABLE', 'THIKENING', and 'POSITIVE'. +#' - Optionally rounds liver score columns by flooring and capping them at 5, and histology-related columns by ceiling. +#' - Reorders columns based on the sum of their values. +#' +#' @return A data frame with harmonized liver scores, optional rounding, and columns reordered based on their sums. #' #' @examples #' \dontrun{ -#' get_liver_lb_score(studyid='1234123', database_path = dbtoken) +#' # Example usage +#' result <- get_col_harmonized_scores_df(liver_score_data_frame = liver_scores, Round = TRUE) #' } -#' @export #' - +#' @export +get_col_harmonized_scores_df <- function(liver_score_data_frame, Round = FALSE) { + # Function implementation here... +} get_col_harmonized_scores_df <- function(liver_score_data_frame, diff --git a/R/get_8ml_data_and_tuned_hyperparameters.R b/R/get_ml_data_and_tuned_hyperparameters.R similarity index 82% rename from R/get_8ml_data_and_tuned_hyperparameters.R rename to R/get_ml_data_and_tuned_hyperparameters.R index c86e1b6..41d3daf 100644 --- a/R/get_8ml_data_and_tuned_hyperparameters.R +++ b/R/get_ml_data_and_tuned_hyperparameters.R @@ -1,40 +1,47 @@ -#' Get Random Forest Data and Best Model +#' @title Get Random Forest Data and Tuned Hyperparameters #' -#' This function retrieves and processes data for random forest analysis from a SQLite database. -#' It performs the following steps: -#' 1. Connects to the SQLite database and retrieves unique `STUDYID` values from the `dm` table. -#' 2. Generates liver toxicity scores for the given study IDs. -#' 3. Harmonizes the columns in the scores data frame. -#' 4. Reads metadata for study IDs. -#' 5. Prepares the data and tunes hyperparameters for a random forest model. +#' @description +#' The `get_ml_data_and_tuned_hyperparameters` function processes input data and metadata to prepare data for +#' random forest analysis. It includes steps for data preprocessing, optional imputation, rounding, +#' error correction, and hyperparameter tuning. #' -#' @param path_db Character. Path to the SQLite database. -#' @param studyid_metadata_path Character. Path to the CSV file containing metadata for study IDs. -#' @param fake_study Logical. Whether to use fake study IDs. Default is `TRUE`. -#' @param use_xpt_file Logical. Whether to use XPT file format. Default is `FALSE`. -#' @param output_individual_scores Logical. Whether to output individual scores. Default is `TRUE`. -#' @param output_zscore_by_USUBJID Logical. Whether to output z-scores by `USUBJID`. Default is `FALSE`. -#' @param Impute Logical. Whether to impute missing values in the data. Default is `TRUE`. -#' @param Round Logical. Whether to round numerical values in the data. Default is `TRUE`. -#' @param reps Integer. Number of repetitions for model evaluation. Default is `1`. -#' @param holdback Numeric. Proportion of data to hold back for validation. Default is `0.75`. -#' @param Undersample Logical. Whether to perform undersampling to balance the data. Default is `TRUE`. -#' @param hyperparameter_tuning Logical. Whether to perform hyperparameter tuning. Default is `FALSE`. -#' @param error_correction_method Character. Method for error correction. Default is `'None'`. +#' @param Data data.frame. Input data frame containing scores, typically named `scores_df`. +#' @param studyid_metadata data.frame. Metadata containing `STUDYID` values, used for joining with `Data`. +#' @param Impute logical. Indicates whether to impute missing values in the dataset using random forest imputation. Default is `FALSE`. +#' @param Round logical. Specifies whether to round specific numerical columns according to predefined rules. Default is `FALSE`. +#' @param reps integer. Number of repetitions for cross-validation. A value of `0` skips repetition. +#' @param holdback numeric. Fraction of data to hold back for testing. A value of `1` performs leave-one-out cross-validation. +#' @param Undersample logical. Indicates whether to undersample the training data to balance the target classes. Default is `FALSE`. +#' @param hyperparameter_tuning logical. Specifies whether to perform hyperparameter tuning for the random forest model. Default is `FALSE`. +#' @param error_correction_method character. Specifies the method for error correction. Can be `"Flip"`, `"Prune"`, or `NULL`. Default is `NULL`. +#' +#' @return +#' A list containing: +#' \describe{ +#' \item{rfData}{The final processed data after preprocessing and error correction.} +#' \item{best.m}{The best `mtry` hyperparameter determined for the random forest model.} +#' } #' -#' @return A list containing the processed data and the best model parameters. #' @export #' #' @examples -#' path_db <- "C:/path/to/database.db" -#' studyid_metadata_path <- "C:/path/to/study_metadata.csv" -#' rfData_and_best_m <- get_rfData_and_best_m( -#' path_db = path_db, -#' studyid_metadata_path = studyid_metadata_path, -#' fake_study = TRUE, +#' # Example usage: +#' Data <- scores_df +#' studyid_metadata <- read.csv("path/to/study_metadata.csv") +#' result <- get_ml_data_and_tuned_hyperparameters( +#' Data = Data, +#' studyid_metadata = studyid_metadata, +#' Impute = TRUE, #' Round = TRUE, -#' Undersample = TRUE +#' reps = 10, +#' holdback = 0.75, +#' Undersample = TRUE, +#' hyperparameter_tuning = TRUE, +#' error_correction_method = "Flip" #' ) +#' rfData <- result$rfData +#' best_mtry <- result$best.m + get_ml_data_and_tuned_hyperparameters <- function(Data, # Data == "scores_df" studyid_metadata, diff --git a/R/get_9rf_model_with_cv.R b/R/get_rf_model_with_cv.R similarity index 79% rename from R/get_9rf_model_with_cv.R rename to R/get_rf_model_with_cv.R index 9120638..949126d 100644 --- a/R/get_9rf_model_with_cv.R +++ b/R/get_rf_model_with_cv.R @@ -1,3 +1,55 @@ +#' @title Random Forest with Cross-Validation +#' +#' @description +#' This function builds a random forest model using the `randomForest` package, evaluates it through cross-validation, +#' and computes performance metrics such as sensitivity, specificity, and accuracy. +#' It optionally applies undersampling to handle class imbalance and supports custom settings for the number of predictors sampled at each split. +#' +#' @param Data Mandatory, data frame +#' The input dataset, which must include a column named `Target_Organ` as the response variable. +#' @param Undersample Optional, logical +#' If `TRUE`, balances the dataset by undersampling the majority class. Default is `FALSE`. +#' @param best.m Optional, numeric or `NULL` +#' Specifies the number of predictors sampled at each split. If `NULL`, the default value of `randomForest` is used. +#' @param testReps Mandatory, integer +#' The number of cross-validation repetitions. Must be at least 2. +#' @param Type Mandatory, numeric +#' Specifies the importance metric type: `1` for Mean Decrease Accuracy or `2` for Gini. +#' +#' @return +#' A list with the following elements: +#' \itemize{ +#' \item \code{performance_metrics}: A vector of aggregated performance metrics, including sensitivity, specificity, and accuracy. +#' \item \code{raw_results}: A list containing raw sensitivity, specificity, and accuracy values for each cross-validation fold. +#' } +#' +#' @details +#' This function splits the input data into training and testing subsets based on the specified `testReps` cross-validation folds. +#' If undersampling is enabled, the function balances the training set to reduce class imbalance. +#' A random forest model is trained on the training set, and predictions are evaluated on the test set. The results are aggregated to provide summary performance metrics. +#' +#' @examples +#' # Load necessary libraries +#' library(randomForest) +#' library(caret) +#' +#' # Example dataset +#' data(iris) +#' iris$Target_Organ <- ifelse(iris$Species == "setosa", 1, 0) +#' iris <- iris[, -5] # Remove Species column +#' +#' # Run the function +#' results <- get_rf_model_with_cv(Data = iris, +#' Undersample = TRUE, +#' best.m = 2, +#' testReps = 5, +#' Type = 2) +#' +#' # Print results +#' print(results$performance_metrics) +#' +#' @export + get_rf_model_with_cv <- function(Data, Undersample = FALSE, @@ -195,6 +247,17 @@ print("......................................................................... } + + + + + + + + + + + # #Add a special case for testReps = 1 that directly splits data into train and test sets without looping or iterative sampling. # if (testReps == 1) { # # Use a single random split (e.g., 70% train, 30% test) diff --git a/R/get_10zone_exclusioned_rf_model_with_cv.R b/R/get_zone_exclusioned_rf_model_with_cv.R similarity index 72% rename from R/get_10zone_exclusioned_rf_model_with_cv.R rename to R/get_zone_exclusioned_rf_model_with_cv.R index 6196177..b01cd72 100644 --- a/R/get_10zone_exclusioned_rf_model_with_cv.R +++ b/R/get_zone_exclusioned_rf_model_with_cv.R @@ -1,3 +1,67 @@ +#' @title Random Forest Model with Cross-validation and Exclusion +#' @description This function implements a Random Forest classification model +#' with cross-validation and allows for undersampling, handling indeterminate +#' predictions, and calculating various model performance metrics such as +#' sensitivity, specificity, and accuracy. It tracks the proportion of indeterminate +#' predictions and provides an aggregated performance summary across multiple test repetitions. +#' +#' @param Data A data frame containing the features and the target variable `Target_Organ` +#' to train the Random Forest model on. +#' +#' @param Undersample A logical value indicating whether to perform undersampling to +#' balance the classes in the training data. Defaults to `FALSE`. +#' +#' @param best.m A numeric value representing the best number of variables (`mytry`) +#' to use at each split in the Random Forest model. This can be manually set or +#' determined through optimization. +#' +#' @param testReps An integer specifying the number of test repetitions. This must +#' be at least 2, as the function relies on multiple test sets to assess the model performance. +#' +#' @param indeterminateUpper A numeric value indicating the upper bound for the +#' predicted probability to consider a prediction indeterminate. Predictions with +#' probabilities within this range are marked as indeterminate. +#' +#' @param indeterminateLower A numeric value indicating the lower bound for the +#' predicted probability to consider a prediction indeterminate. Predictions with +#' probabilities within this range are marked as indeterminate. +#' +#' @param Type An integer indicating the type of feature importance to use in the +#' Random Forest model. Typically, `1` for "Mean Decrease Accuracy" or `2` for "Mean Decrease Gini". +#' +#' @return A list containing two components: +#' \describe{ +#' \item{performance_metrics}{A vector with the aggregated performance metrics, +#' including sensitivity, specificity, accuracy, and others, calculated across +#' all test repetitions.} +#' \item{raw_results}{A list containing the raw performance metrics for each repetition, +#' including sensitivity, specificity, and accuracy.} +#' } +#' +#' @examples +#' \dontrun{ +#' # Example usage +#' Data <- your_data_frame # Replace with actual dataset +#' results <- get_zone_exclusioned_rf_model_with_cv(Data = Data, +#' Undersample = TRUE, +#' best.m = 5, +#' testReps = 10, +#' indeterminateUpper = 0.8, +#' indeterminateLower = 0.2, +#' Type = 1) +#' +#' # View the aggregated performance metrics +#' print(results$performance_metrics) +#' +#' # Access raw results for further analysis +#' print(results$raw_results) +#' } +#' +#' @seealso \link[randomForest]{randomForest}, \link[caret]{confusionMatrix} +#' +#' @import randomForest +#' @import caret + get_zone_exclusioned_rf_model_with_cv <- function(Data=NULL, #scores_df Undersample = FALSE, @@ -6,7 +70,7 @@ get_zone_exclusioned_rf_model_with_cv <- function(Data=NULL, #scores_df indeterminateUpper, indeterminateLower, Type) { -browser() + rfData <- Data #rfData <- scores_df #--------------------------------------------------------------------- # Initialize model performance metric trackers------------------------ diff --git a/inst/UntitledR.R b/inst/UntitledR.R new file mode 100644 index 0000000..90a1d14 --- /dev/null +++ b/inst/UntitledR.R @@ -0,0 +1,362 @@ +get_liver_om_lb_mi_tox_score_list <- function (studyid_or_studyids = FALSE, + path_db, + fake_study = FALSE, + use_xpt_file = FALSE, + # multiple_xpt_folder = FALSE, + output_individual_scores = FALSE, + output_zscore_by_USUBJID = FALSE) { + + # "multiple_xpt_folder" argument control the studyid/xpt folder directory + # Enforce mutual exclusivity: If both are TRUE, throw an error + if (output_individual_scores && output_zscore_by_USUBJID) { + stop("Error: Both 'return_individual_scores' and 'output_zscore_by_USUBJID' cannot be TRUE at the same time.") + } + if(output_individual_scores ) { + #initialize requires several data frame and list container + } else if (output_zscore_by_USUBJID){ + #initialize requires several data frame and list container + } else { + #initialize requires several data frame and list container + } + # iterate over studyid or each xpt folder + #for (studyid in selected_studies){ + for (studyid in studyid_or_studyids ){ + + #if( multiple_xpt_folder == TRUE) { + if(use_xpt_file == TRUE) { + + path_db <- studyid # giving the path of the xpt folder + + print(path_db) + } + + # Initialize a flag variable at the start of each iteration + first_block_success <- TRUE + + # First Block with its own tryCatch for master_compiledata~~~~~~~~~~~~~~~~~~ + tryCatch({ + + # Set 'studyid' to NULL if using an XPT file, otherwise keep the original value. + studyid <- if (use_xpt_file) NULL else studyid + + # if use_xpt_file = TRUE,studyid should be NULL.......................... + # Call "get_liver_compiledata" function to get the master_compiledata + output_get_compile_data <- get_compile_data(studyid = studyid , + path_db = path_db, # problem is here + fake_study = fake_study, + use_xpt_file = use_xpt_file) + + # GET the "master_compiledata" -data frame- from the output of the -- + master_compiledata <- output_get_compile_data + + # Create a copy of master_compiledata for the diagnostic purpose + master_compiledata_copy <- master_compiledata + }, + # Handle the errors + error = function(e) { + master_error_df <<- rbind(master_error_df, error_block1) + + # Set the flag to FALSE to indicate the first block failed + first_block_success <<- FALSE + + }) + + # Check the flag to decide whether to proceed to the next iteration of the loop + if (!first_block_success) { + + # Append STUDYID to the error_studies list + Error_studies <- c(Error_studies, studyid) + + next + } + + #-----------------end of master_compiledata calculation---------------------- + #This block for "Adding a new row for the current STUDYID in FOUR_Liver_Score" + tryCatch({ + + + if (!output_individual_scores && !output_zscore_by_USUBJID) { + new_row_in_four_liver_scr_avg <- data.frame(STUDYID = unique(master_compiledata$STUDYID), + BWZSCORE_avg = NA, + liverToBW_avg = NA, + LB_score_avg = NA, + MI_score_avg = NA) + FOUR_Liver_Score_avg <- rbind(FOUR_Liver_Score_avg, new_row_in_four_liver_scr_avg) + } + + }, error = function(e) { # error handling properly + + }) + #-----------------END--of ---score_accumulation_df---------------------------- + #------------------Calculation_of--BodyWeight_zScore-------------------------- + + tryCatch({ + if(output_individual_scores){ + # Set 'studyid' to NULL if using an XPT file, otherwise keep the original value. + studyid <- if (use_xpt_file) NULL else studyid + + bwzscore_BW <- get_bw_score (studyid = studyid, + path_db = path_db, + fake_study = fake_study, + use_xpt_file = use_xpt_file, + master_compiledata = master_compiledata, + return_individual_scores = TRUE, + return_zscore_by_USUBJID = FALSE) + + } else if (output_zscore_by_USUBJID) { + + # Set 'studyid' to NULL if using an XPT file, otherwise keep the original value. + studyid <- if (use_xpt_file) NULL else studyid + BW_zscore_by_USUBJID_HD <-get_bw_score(studyid = studyid, + path_db = path_db, + fake_study = fake_study, + use_xpt_file = use_xpt_file, + master_compiledata = master_compiledata, + return_individual_scores = FALSE, + return_zscore_by_USUBJID = TRUE) + + BW_zscore_by_USUBJID_HD <- as.data.frame(BW_zscore_by_USUBJID_HD) + + } else { + # Set 'studyid' to NULL if using an XPT file, otherwise keep the original value. + studyid <- if (use_xpt_file) NULL else studyid + + + averaged_HD_BWzScore <- get_bw_score (studyid = studyid, + path_db = path_db, + fake_study = fake_study, + use_xpt_file = use_xpt_file, + master_compiledata = master_compiledata , + return_individual_scores = FALSE, + return_zscore_by_USUBJID = FALSE) + print(averaged_HD_BWzScore) + # # Add the liverToBW_zscore to "FOUR_Liver_Score" data frame............ + + # Extract the liverToBW value for the current STUDYID from liverToBW_df + calculated_BWzScore_value <- averaged_HD_BWzScore$BWZSCORE_avg[averaged_HD_BWzScore$STUDYID == unique(master_compiledata$STUDYID)] + #calculated_liverToBW_value <- liverToBW_df$liverToBW_avg + + # Update the liverToBW value in FOUR_Liver_Score_avg for the current STUDYID + FOUR_Liver_Score_avg$BWZSCORE_avg[FOUR_Liver_Score_avg$STUDYID == unique(master_compiledata$STUDYID)] <- calculated_BWzScore_value + + } + }, error = function(e) { + #hanfle error properly + + }) + #---------------------------"OM_DATA"-(Liver_Organ to Body Weight zScore)------- + tryCatch({ + # Set 'studyid' to NULL if using an XPT file, otherwise keep the original value. + studyid <- if (use_xpt_file) NULL else studyid + + if(output_individual_scores){ + # when, output_individual_scores == TRUE + # bwzscore_BW need to be calculated, so we don't need to calcualte + # here + HD_liver_zscore_df <- get_livertobw_score (studyid = studyid, + path_db = path_db, + fake_study = fake_study, + use_xpt_file = use_xpt_file, + master_compiledata = master_compiledata, + bwzscore_BW = bwzscore_BW, + return_individual_scores = TRUE, + return_zscore_by_USUBJID = FALSE) + + HD_liver_zscore_df <- as.data.frame(HD_liver_zscore_df) + master_liverToBW <- rbind(master_liverToBW, HD_liver_zscore_df ) + + } else if (output_zscore_by_USUBJID) { + + # Set 'studyid' to NULL if using an XPT file, otherwise keep the original value. + studyid <- if (use_xpt_file) NULL else studyid + + bwzscore_BW <- get_bw_score (studyid = studyid, + path_db = path_db, + fake_study = fake_study, + use_xpt_file = use_xpt_file, + master_compiledata = master_compiledata, + return_individual_scores = TRUE, + return_zscore_by_USUBJID = FALSE) + liverTOBW_zscore_by_USUBJID_HD <- get_livertobw_score (studyid = studyid, + path_db = path_db , + fake_study = fake_study, + use_xpt_file = use_xpt_file, + master_compiledata = master_compiledata, + bwzscore_BW = bwzscore_BW, + return_individual_scores = FALSE, + return_zscore_by_USUBJID = TRUE) + + liverTOBW_zscore_by_USUBJID_HD <- as.data.frame(liverTOBW_zscore_by_USUBJID_HD) + liverTOBW_study_identifier <- unique(liverTOBW_zscore_by_USUBJID_HD$STUDYID) + # Use the study_identifier as the list index + master_liverToBW[[as.character(liverTOBW_study_identifier)]] <- liverTOBW_zscore_by_USUBJID_HD + + } else { + + # Set 'studyid' to NULL if using an XPT file, otherwise keep the original value. + studyid <- if (use_xpt_file) NULL else studyid + + # if (is.null(bwzscore_BW)) { + bwzscore_BW <- get_bw_score (studyid = studyid, + path_db = path_db, + fake_study = fake_study, + use_xpt_file = use_xpt_file, + master_compiledata = master_compiledata, + return_individual_scores = TRUE, + return_zscore_by_USUBJID = FALSE) + + averaged_liverToBW_df <- get_livertobw_score (studyid = studyid, + path_db = path_db, + fake_study = fake_study, + use_xpt_file = use_xpt_file, + master_compiledata = master_compiledata, + bwzscore_BW = bwzscore_BW , + return_individual_scores = FALSE, + return_zscore_by_USUBJID = FALSE) + + # Create "liverToBW_df" for FOUR_Liver_Score_avg + liverToBW_df <- averaged_liverToBW_df %>% + dplyr::rename(liverToBW_avg = avg_liverToBW_zscore) + + # Extract the liverToBW value for the current STUDYID from liverToBW_df + calculated_liverToBW_value <- liverToBW_df$liverToBW_avg[liverToBW_df$STUDYID == unique(master_compiledata$STUDYID)] + #calculated_liverToBW_value <- liverToBW_df$liverToBW_avg + # Update the liverToBW value in FOUR_Liver_Score_avg for the current STUDYID + print(calculated_liverToBW_value) + FOUR_Liver_Score_avg$liverToBW_avg[FOUR_Liver_Score_avg$STUDYID == unique(master_compiledata$STUDYID)] <- calculated_liverToBW_value + + } + }, error = function(e) { + #handle error properly + }) + + #<><><><><><><><><><><><><><><><><><>"""LB"""" zscoring <><><><><><><><><><><> + tryCatch({ + # Set 'studyid' to NULL if using an XPT file, otherwise keep the original value. + studyid <- if (use_xpt_file) NULL else studyid + if(output_individual_scores){ + master_lb_scores <- get_lb_score(studyid = studyid, + path_db = path_db, + fake_study= fake_study, + use_xpt_file = use_xpt_file, + master_compiledata = master_compiledata, + return_individual_scores = TRUE, + return_zscore_by_USUBJID = FALSE) + + master_lb_score_six <- rbind(master_lb_score_six , master_lb_scores) + + } else if (output_zscore_by_USUBJID) { + # Set 'studyid' to NULL if using an XPT file, otherwise keep the original value. + studyid <- if (use_xpt_file) NULL else studyid + LB_zscore_by_USUBJID_HD <- get_lb_score(studyid = studyid, + path_db = path_db, + fake_study= fake_study, + use_xpt_file = use_xpt_file, + master_compiledata = master_compiledata, + return_individual_scores = FALSE, + return_zscore_by_USUBJID = TRUE) + + lb_study_identifier <- unique(LB_zscore_by_USUBJID_HD$STUDYID) + # append to the master data frame list + # Use the study_identifier as the list index + master_lb_score[[as.character(lb_study_identifier)]] <- LB_zscore_by_USUBJID_HD + } else { + # Set 'studyid' to NULL if using an XPT file, otherwise keep the original value. + studyid <- if (use_xpt_file) NULL else studyid + + averaged_LB_score <- get_lb_score(studyid = studyid, + path_db = path_db, + fake_study= fake_study, + use_xpt_file = use_xpt_file , + master_compiledata = master_compiledata, + return_individual_scores = FALSE, + return_zscore_by_USUBJID = FALSE) + # Extract the LB_score value for the current STUDYID from LB_df + calculated_LB_value <- averaged_LB_score$LB_score_avg[ averaged_LB_score$STUDYID == unique(master_compiledata$STUDYID)] + + # Update the LB_score value in FOUR_Liver_Score for the current STUDYID + FOUR_Liver_Score_avg$LB_score_avg[FOUR_Liver_Score_avg$STUDYID == unique(master_compiledata$STUDYID)] <- calculated_LB_value + + } + }, error = function(e) { + # handle error properly + }) + #<><><><><><><><><><><><><><><><><><>"""MI"""" zscoring <><><><><><><><><><><> + tryCatch({ + + if(output_individual_scores ){ + # Set 'studyid' to NULL if using an XPT file, otherwise keep the original value. + studyid <- if (use_xpt_file) NULL else studyid + mi_score_final_list_df <- get_mi_score(studyid = studyid, + path_db = path_db, + fake_study = fake_study, + use_xpt_file = use_xpt_file, + master_compiledata = master_compiledata , + return_individual_scores = TRUE, + return_zscore_by_USUBJID = FALSE) + master_mi_df <- dplyr::bind_rows(master_mi_df, mi_score_final_list_df) + + } else if (output_zscore_by_USUBJID) { + studyid <- if (use_xpt_file) NULL else studyid + MI_score_by_USUBJID_HD <-get_mi_score(studyid = studyid, + path_db = path_db, + fake_study = fake_study, + use_xpt_file = use_xpt_file, + master_compiledata = master_compiledata , + return_individual_scores = FALSE, + return_zscore_by_USUBJID = TRUE) + + mi_study_identifier <- unique(MI_score_by_USUBJID_HD$STUDYID) + master_mi_score[[as.character(mi_study_identifier)]] <- MI_score_by_USUBJID_HD + + } else{ + # Set 'studyid' to NULL if using an XPT file, otherwise keep the original value. + studyid <- if (use_xpt_file) NULL else studyid + + averaged_MI_score <- get_mi_score(studyid = studyid, + path_db = path_db, + fake_study = fake_study, + use_xpt_file = use_xpt_file, + master_compiledata = master_compiledata , + return_individual_scores = FALSE, + return_zscore_by_USUBJID = FALSE) + # Extract the "LB_score"MI_score_avg" value for the current STUDYID from + calculated_MI_value <- averaged_MI_score$MI_score_avg[averaged_MI_score$STUDYID == unique(master_compiledata$STUDYID)] + + # Update the LB_score value in FOUR_Liver_Score for the current STUDYID + FOUR_Liver_Score_avg$MI_score_avg[FOUR_Liver_Score_avg$STUDYID == unique(master_compiledata$STUDYID)] <- calculated_MI_value + + } + }, error = function(e) { + # handle error properly + }) + } + if (output_individual_scores) { + # Perform the merge using full_join to keep all rows from each data frame + combined_output_individual_scores <- master_liverToBW %>% + dplyr::full_join(master_lb_score_six, by = "STUDYID") %>% + dplyr::full_join(master_mi_df, by = "STUDYID") + } else if (output_zscore_by_USUBJID ) { + combined_liverToBW <- dplyr::bind_rows(master_liverToBW) + combined_lb_score <- dplyr::bind_rows(master_lb_score) + combined_mi_score <- dplyr::bind_rows(master_mi_score) + # Merge the first two data frames (df_liverToBW and df_LB) on STUDYID and USUBJID + combined_df <- combined_liverToBW %>% + dplyr::full_join(combined_lb_score, by = c("STUDYID", "USUBJID")) + + # Merge the result with the third data frame (df_mi) on STUDYID and USUBJID + final_output_zscore_by_USUBJID <- combined_df %>% + dplyr::full_join( combined_mi_score, by = c("STUDYID", "USUBJID")) + } else { + FOUR_Liver_Score_avg <- FOUR_Liver_Score_avg + # Round all columns from the second column onward to two decimal places + FOUR_Liver_Score_avg[, 2:ncol(FOUR_Liver_Score_avg)] <- round(FOUR_Liver_Score_avg[, 2:ncol(FOUR_Liver_Score_avg)], 2) + } + if (output_individual_scores) { + return(combined_output_individual_scores) + } else if(output_zscore_by_USUBJID) { + return(final_output_zscore_by_USUBJID) + } else { + return(FOUR_Liver_Score_avg) + } +} diff --git a/vignettes/get_col_harmonized_scores_df.Rmd b/vignettes/get_col_harmonized_scores_df.Rmd new file mode 100644 index 0000000..376a21e --- /dev/null +++ b/vignettes/get_col_harmonized_scores_df.Rmd @@ -0,0 +1,62 @@ +--- +title: "Function Documentation: get_col_harmonized_scores_df" +output: html_document +--- + +## Function: `get_col_harmonized_scores_df` + +### Description + +This function takes a data frame containing liver score data, harmonizes the column names, handles missing values, and performs optional rounding of specific score columns. It aims to standardize and clean the data for further analysis by: +- Replacing spaces, commas, and slashes in column names with dots. +- Handling missing values by replacing them with zero. +- Harmonizing columns with similar meanings (synonyms). +- Removing unwanted columns. +- Optionally rounding columns related to liver scores and histology scores. + +### Parameters + +- `liver_score_data_frame` (data.frame): A data frame containing liver score data with column names that may need harmonization. +- `Round` (logical, default = FALSE): If `TRUE`, the function will round the values in certain columns based on specific rules. + +### Details + +1. **Column Harmonization**: + - Spaces, commas, and slashes in column names are replaced with dots. + - Missing values (NA) are replaced with zeros. + +2. **Synonym Harmonization**: + - Columns with similar meanings (synonyms) are identified and harmonized by replacing their values with the higher value between them. + - Specific columns such as 'STUDYID', 'UNREMARKABLE', 'THIKENING', and 'POSITIVE' are excluded from harmonization. + +3. **Optional Rounding**: + - If `Round` is set to `TRUE`, the function rounds certain columns: + - Liver-related columns (`avg_`, `liver`) are floored to the nearest integer and capped at 5. + - Histology-related columns are ceiled to the nearest integer. + +4. **Column Reordering**: + - Columns are reordered based on the sum of their values (excluding the first column). + - Columns with higher sums are moved to the left, ensuring that the most "important" columns appear first. + +5. **Column Removal**: + - Columns related to specific endpoints (e.g., 'INFILTRATE', 'UNREMARKABLE', 'THIKENING', 'POSITIVE') are removed from the final data frame. + +### Return Value + +- A data frame with harmonized columns, optional rounding applied, and columns ordered based on the sum of their values. + +### Example Usage + +```r +# Sample liver score data frame +liver_scores <- data.frame( + STUDYID = c(1, 2, 3), + INFILTRATE = c(0, 1, 0), + avg_Liver = c(3.5, 4.2, 2.1), + POSITIVE = c(0, 0, 1), + `Thickening` = c(0, 0, 1), + Liver_to_BW_zscore = c(3, 2, 4) +) + +# Call the function with Round = TRUE +result <- get_col_harmonized_scores_df(liver_score_data_frame = liver_scores, Round = TRUE) diff --git a/vignettes/get_ml_data_and_tuned_hyperparameters.Rmd b/vignettes/get_ml_data_and_tuned_hyperparameters.Rmd new file mode 100644 index 0000000..0ff6c48 --- /dev/null +++ b/vignettes/get_ml_data_and_tuned_hyperparameters.Rmd @@ -0,0 +1,120 @@ +--- +title: "Documentation for `get_ml_data_and_tuned_hyperparameters` Function" +output: html_document +--- + +## Overview + +The `get_ml_data_and_tuned_hyperparameters` function processes and prepares machine learning data for modeling, with various optional preprocessing steps such as missing value imputation, undersampling, and hyperparameter tuning. It also supports error correction via specific methods like "Flip" and "Prune". + +## Function Definition + +```r +get_ml_data_and_tuned_hyperparameters <- function(Data, + studyid_metadata, + Impute = FALSE, + Round = FALSE, + reps, + holdback, + Undersample = FALSE, + hyperparameter_tuning = FALSE, + error_correction_method = NULL) { + # Function implementation +} +``` +```r +result <- get_ml_data_and_tuned_hyperparameters(Data = scores_df, + studyid_metadata = metadata_df, + Impute = TRUE, + Round = TRUE, + reps = 10, + holdback = 0.25, + Undersample = TRUE, + hyperparameter_tuning = TRUE, + error_correction_method = "Flip") + +# Access the final data and best mtry hyperparameter +rfData <- result$rfData +best_mtry <- result$best.m + +) +``` +## Parameters + +- **Data** (*data frame*): + Input data containing the scores. This will typically be a data frame named `scores_df`. + +- **studyid_metadata** (*data frame*): + A data frame containing metadata, typically including the `STUDYID` column, which is used for joining with the `Data`. + +- **Impute** (*logical*): + If `TRUE`, missing values in the dataset will be imputed using random forest imputation. + +- **Round** (*logical*): + If `TRUE`, specific columns will be rounded according to the rules described in the function. + +- **reps** (*numeric*): + The number of repetitions for cross-validation. A value of 0 skips repetition. + +- **holdback** (*numeric*): + The fraction of data to hold back for testing. A value of 1 means leave-one-out cross-validation. + +- **Undersample** (*logical*): + If `TRUE`, the training data will be undersampled to balance the target classes. + +- **hyperparameter_tuning** (*logical*): + If `TRUE`, hyperparameter tuning will be performed using cross-validation. + +- **error_correction_method** (*character*): + Specifies the error correction method to use. Can be one of `"Flip"`, `"Prune"`, or `"None"`. Defaults to `NULL`, which means no correction. + +## Returns + +- **A list containing**: + - **rfData**: + The final prepared data after preprocessing, imputation, and any error correction methods. + - **best.m**: + The best `mtry` hyperparameter for the random forest model (determined through tuning or default). + +## Function Workflow + +### Data Merging +- The function first joins the metadata (`studyid_metadata`) with the input data (`Data`) based on the `STUDYID` column. + +### Target Variable Encoding +- The target variable (`Target_Organ`) is encoded such that: + - `'Liver'` is encoded as `1`. + - `'not_Liver'` is encoded as `0`. +- This encoding facilitates the modeling process. + +### Missing Value Imputation +- If `Impute` is `TRUE`, missing values are imputed using the `randomForest::rfImpute` function. + +### Rounding of Specific Columns +- If `Round` is `TRUE`: + - Columns related to averages or liver-related data are rounded down using `floor()`. + - Other columns (e.g., `"MI"` columns) are rounded up using `ceiling()`. + +### Data Splitting +- The data is split into training and testing sets: + - A fraction of the data (`holdback`) is held back for testing. + - For each repetition (`reps`), the data is split again. +- The training set is optionally undersampled to balance the target classes. + +### Hyperparameter Tuning +- If `hyperparameter_tuning` is `TRUE`: + - The function performs hyperparameter tuning for the random forest model using cross-validation with `trainControl` from the **caret** package. + - The `mtry` parameter is tuned, which controls the number of variables randomly sampled as candidates at each split. + +### Model Training +- A random forest model is trained on the prepared data using the **randomForest** package. +- The `best.m` hyperparameter is selected based on the tuning or set to a default value. + +### Error Correction +- If `error_correction_method` is specified, the function corrects the predictions based on the chosen method: + - `"Flip"`: Flips the target class if certain conditions are met. + - `"Prune"`: Removes instances that are misclassified. + - `"None"`: No error correction is applied. + +### Final Data Return +- The processed data (`rfData`) and the best `mtry` hyperparameter (`best.m`) are returned. diff --git a/vignettes/get_rf_model_with_cv.Rmd b/vignettes/get_rf_model_with_cv.Rmd new file mode 100644 index 0000000..da5fc17 --- /dev/null +++ b/vignettes/get_rf_model_with_cv.Rmd @@ -0,0 +1,91 @@ +--- +title: "Documentation: get_rf_model_with_cv" +author: "Your Name" +date: "`r Sys.Date()`" +output: html_document +--- + +# Introduction + +The `get_rf_model_with_cv` function implements a random forest-based modeling pipeline with cross-validation to assess model performance. It includes optional undersampling for handling imbalanced data and provides detailed metrics for evaluating model accuracy. + +# Function Overview + +```r +get_rf_model_with_cv <- function(Data, + Undersample = FALSE, + best.m = NULL, # any numeric value or call function to get it + testReps, # testReps must be at least 2; + Type) { + ... +} +``` +# Purpose + +This function: + +- Builds a random forest model using the `randomForest` package. +- Performs cross-validation to evaluate model metrics. +- Optionally applies undersampling to balance datasets. +- Returns aggregated performance metrics. + +# Parameters + +| Parameter | Type | Description | +|-------------|---------------|-----------------------------------------------------------------------------| +| `Data` | Data Frame | Input dataset. Must include a `Target_Organ` column as the response variable.| +| `Undersample` | Logical | If `TRUE`, balances the dataset by undersampling the majority class. | +| `best.m` | Numeric/NULL | Number of predictors sampled at each split. If `NULL`, default is used. | +| `testReps` | Integer | Number of cross-validation folds (must be >= 2). | +| `Type` | Numeric | Type of importance metric (`1` for Mean Decrease Accuracy, `2` for Gini). | + +# Outputs + +The function returns a list containing: + +1. `performance_metrics`: Aggregated performance metrics including sensitivity, specificity, and accuracy. +2. `raw_results`: Raw data of sensitivity, specificity, and accuracy for each cross-validation fold. + +# Cross-Validation Workflow + +## Data Preparation + +- Splits data into training and testing subsets based on the specified `testReps`. +- Optionally applies undersampling to balance the training set. + +## Model Training + +- Trains a random forest model using the `randomForest` package. + +## Prediction and Metrics Calculation + +- Predicts probabilities on the test set. +- Computes metrics (sensitivity, specificity, accuracy, etc.) using the `caret` package. + +## Performance Summary + +- Aggregates performance metrics across cross-validation folds. + +# Example Usage + +```r +# Load necessary libraries +library(randomForest) +library(caret) + +# Example dataset +data(Data) +Data$Target_Organ <- ifelse(iris$Species == "setosa", 1, 0) + +# Run the function +results <- get_rf_model_with_cv(Data = iris[, -5], + Undersample = TRUE, + best.m = 2, + testReps = 5, + Type = 2) + +# Print results +print(results$performance_metrics) +``` +# Conclusion +The get_rf_model_with_cv function is a powerful tool for evaluating random forest models with cross-validation, especially for datasets with class imbalance. Adjust parameters such as Undersample and best.m to optimize performance for your specific dataset. diff --git a/vignettes/get_zone_exclusioned_rf_model_with_cv.Rmd b/vignettes/get_zone_exclusioned_rf_model_with_cv.Rmd new file mode 100644 index 0000000..03c18b1 --- /dev/null +++ b/vignettes/get_zone_exclusioned_rf_model_with_cv.Rmd @@ -0,0 +1,107 @@ +--- +title: "Random Forest Model with Cross-validation and Exclusion" +author: "Your Name" +output: html_document +--- + +## Introduction + +The `get_zone_exclusioned_rf_model_with_cv` function implements a Random Forest classification model with cross-validation. It provides tools for evaluating the model's performance, including sensitivity, specificity, accuracy, and other metrics. The function allows users to handle indeterminate predictions and includes an option for undersampling the data, which can be particularly useful when dealing with imbalanced datasets. + +This document explains how to use the function, describes its inputs, outputs, and the key steps involved in the model training and evaluation process. + +## Function Purpose + +The main goal of this function is to train a Random Forest model and evaluate it using cross-validation. The function: + +- Performs cross-validation across a specified number of repetitions (`testReps`). +- Allows for undersampling of the dataset to address class imbalance if required. +- Handles indeterminate predictions by setting them to `NA`. +- Tracks performance metrics like sensitivity, specificity, positive predictive value (PPV), and accuracy for each repetition. +- Provides an aggregated summary of performance metrics across all repetitions. + +Additionally, the function provides an option to adjust the feature importance calculation, either using the Gini index or the Mean Decrease Accuracy. + +## Parameters + +The function accepts the following parameters: + +- **Data** (`Data`): A data frame containing the features and the target variable (`Target_Organ`) to train the model on. + +- **Undersample** (`Undersample`): A boolean parameter that indicates whether to perform undersampling on the data to balance the class distribution. If set to `TRUE`, the function will undersample the negative class to match the number of positive class instances. + +- **Best Model Parameter** (`best.m`): A numeric value indicating the best number of variables (`mytry`) to use at each split in the Random Forest model. This value can be provided manually or determined through optimization. + +- **Test Repetitions** (`testReps`): The number of times to repeat the cross-validation process. This value must be at least 2, as the function relies on multiple test sets to assess model performance. + +- **Indeterminate Prediction Thresholds** (`indeterminateUpper`, `indeterminateLower`): These parameters define the upper and lower bounds for predicting "indeterminate" values. If a model's predicted probability falls between these thresholds, the prediction is considered indeterminate and set to `NA`. + +- **Feature Importance Type** (`Type`): An integer indicating the type of feature importance to use in the Random Forest model. Typically, this will be either `1` for "Mean Decrease Accuracy" or `2` for "Mean Decrease Gini". + +## Model Workflow + +1. **Data Preparation**: + - The input data frame (`Data`) is processed to ensure that it is formatted correctly for model training. The column names are simplified to numeric identifiers for easier manipulation. + +2. **Cross-validation**: + - The function performs cross-validation, repeating the training and testing process for the specified number of repetitions (`testReps`). In each repetition: + - The dataset is split into a training set and a test set, with each iteration using different random samples. + - The Random Forest model is trained on the training set, and predictions are made on the test set. + +3. **Undersampling (Optional)**: + - If `Undersample` is set to `TRUE`, the function balances the dataset by undersampling the negative class. The positive class is left unchanged, and the negative class is reduced to match the size of the positive class. + +4. **Prediction and Evaluation**: + - After training the model, predictions are made on the test data. The predicted probabilities are stored and later used to calculate performance metrics. + - Indeterminate predictions are identified based on the upper and lower thresholds (`indeterminateUpper` and `indeterminateLower`). These predictions are marked as `NA` and not included in performance calculations. + +5. **Performance Metrics**: + - For each repetition, the function calculates various performance metrics, including: + - **Sensitivity**: The proportion of true positives correctly identified by the model. + - **Specificity**: The proportion of true negatives correctly identified by the model. + - **Accuracy**: The overall accuracy of the model in predicting both classes. + - **PPV (Positive Predictive Value)**: The proportion of positive predictions that are correct. + - **NPV (Negative Predictive Value)**: The proportion of negative predictions that are correct. + - **Prevalence**: The proportion of positive cases in the dataset. + - These metrics are computed using the `caret` package’s confusion matrix function. + +6. **Aggregated Results**: + - After completing all test repetitions, the function calculates the mean of each performance metric across all repetitions to provide an aggregated performance summary. + - The results include both individual metrics for each repetition and the overall performance summary. + +## Outputs + +The function returns a list with two components: + +1. **performance_metrics**: A vector containing the aggregated performance metrics (mean sensitivity, specificity, accuracy, etc.) calculated across all test repetitions. + +2. **raw_results**: A list containing the raw performance metrics for each repetition, including: + - `sensitivity`: A vector of sensitivity values for each test repetition. + - `specificity`: A vector of specificity values for each test repetition. + - `accuracy`: A vector of accuracy values for each test repetition. + +These outputs can be used to evaluate the model's performance and further analyze the results. + +## Example Usage + +Below is an example of how to use the function: + +```r +# Example dataset (replace with actual data) +Data <- your_data_frame + +# Run the model with cross-validation and undersampling +results <- get_zone_exclusioned_rf_model_with_cv(Data = Data, + Undersample = TRUE, + best.m = 5, + testReps = 10, + indeterminateUpper = 0.8, + indeterminateLower = 0.2, + Type = 1) + +# View the aggregated performance metrics +print(results$performance_metrics) + +# Access raw results for further analysis +print(results$raw_results) +```