Lung cancer
remains the leading cause of cancer-related mortality worldwide. Unfortunately, only 16% of cases are diagnosed at an early, localized stage, where patients have a five-year survival rate exceeding 50%. When lung cancer is identified at more advanced stages, the survival rate plummets to just 5%.
Given this stark difference, early diagnosis is critical
for improving patient outcomes. Non-invasive imaging methods, such as computed tomography (CT), have proven effective in providing crucial information regarding tumor status. This opens opportunities for developing computer-aided diagnosis (CAD) systems capable of assessing the malignancy risk of lung nodules and supporting clinical decision-making.
The goal of this project is to create a machine learning-based solution for classifying lung nodules as benign or malignant using CT images available within LIDC-IDRI dataset.
As a request from ou professor this project was developed using a Notebook
. Therefore if you're looking forward to test it out yourself, keep in mind to either use a Anaconda Distribution or a 3rd party software that helps you inspect and execute it.
Therefore, for more informations regarding the Virtual Environment used in Anaconda, consider checking the DEPENDENCIES.md file.
The project will involve several key phases
, including:
Data Preprocessing
: Cleaning and preparing the CT scan data to ensure its quality and consistency for further analysis.Feature Engineering
: Leveraging radiomics to extract meaningful features from the scans.Model Development and Evaluation
: Training and fine-tuning machine learning models to accurately classify lung nodules based on their malignancy status. It also focuses on assessing model performance using key metrics such as balanced accuracy and AUC, and validating results through robust methods such as k-fold cross-validation.Statistical Inference
: Conduct a statistical analysis to determine performance differences between the models and identify which one delivers the best results for this classification task.
The ultimate objective of this automated classification system is to aid in clinical decision-making
, offering a supplementary screening tool that reduces the workload on radiologists while improving early detection rates for lung cancer.
If you're interested in inspecting and executing this project yourself, you'll need access to all the datasets
we've created.
Since GitHub has file size limits, we've made them all available in a Cloud Storage provided by Google Drive which you can access here.
Here’s a quick overview of how the nodular malignancy
in the dataset is distributed across five different levels of malignancy.
Here are some of the results obtained from various selected machine learning algorithms
, which we found to be the most interesting based on their balanced accuracy scores.
Performance Evaluation
|
|
---|---|
Algorithm
|
Metrics
|
SVM
|
|
Random Forest
|
|
XGBoost
|
|
Voting Classifier
|
To better illustrate the performance differences between the models, let's examine their respective critical differences diagram
.
In this diagram, XGBoost and the Voting Classifier share the same rank (2.2), suggesting that they performed similarly and may be the most suited for providing a solution to the classification problem.
- Authors → Francisco Macieira, Gonçalo Esteves and Nuno Gomes
- Course → Laboratory of AI and DS [CC3044]
- University → Faculty of Sciences, University of Porto
README.md by Gonçalo Esteves