Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Srishti #1

Open
wants to merge 48 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
63ce28f
Create README.md
Srishti013 Sep 15, 2022
f19fc35
Add files via upload
Srishti013 Sep 15, 2022
fb0569a
Create __init__.py
Srishti013 Sep 15, 2022
423bbe6
Add files via upload
Srishti013 Sep 15, 2022
f7a3909
Create Readme.md
Srishti013 Sep 15, 2022
c54bbe7
Add files via upload
Srishti013 Sep 15, 2022
50b02c7
Delete Readme.md
Srishti013 Sep 15, 2022
fae899c
Create readme.md
Srishti013 Sep 15, 2022
37c1ffd
Add files via upload
Srishti013 Sep 15, 2022
b9e1903
Delete readme.md
Srishti013 Sep 15, 2022
f6b9b21
Create funcs_analysis.py
Srishti013 Sep 15, 2022
8ca45e5
Add files via upload
Srishti013 Sep 15, 2022
d156a88
Create README.md
Srishti013 Sep 15, 2022
53af2bf
Add files via upload
Srishti013 Sep 15, 2022
a8c6230
Delete .gitignore
Srishti013 Sep 15, 2022
cafe4ed
Delete README.md
Srishti013 Sep 15, 2022
5d5c1cd
Delete funcs.py
Srishti013 Sep 15, 2022
9115ba4
Delete neck_area.py
Srishti013 Sep 15, 2022
87220ef
Delete neck_volume.py
Srishti013 Sep 15, 2022
63ae74f
Create README.md
Srishti013 Sep 15, 2022
39b587f
Create readme.md
Srishti013 Sep 16, 2022
e9e2b0c
Add files via upload
Srishti013 Sep 16, 2022
e98df63
Delete readme.md
Srishti013 Sep 16, 2022
67e4f72
Update README.md
Srishti013 Sep 20, 2022
2e6b084
Update README.md
Srishti013 Sep 20, 2022
232d568
Update README.md
Srishti013 Sep 27, 2022
76975e2
Add files via upload
Srishti013 Sep 27, 2022
645c037
Update README.md
Srishti013 Sep 27, 2022
9d21f84
Add files via upload
Srishti013 Sep 27, 2022
211e59a
Update README.md
Srishti013 Sep 27, 2022
51c19c7
Create readme.md
Srishti013 Sep 27, 2022
8ccc105
Add files via upload
Srishti013 Sep 27, 2022
ab1e58e
Add files via upload
Srishti013 Sep 27, 2022
6ada4a9
Update README.md
Srishti013 Sep 27, 2022
5690f8e
Create Readme.md
Srishti013 Sep 27, 2022
b266d84
Update Readme.md
Srishti013 Sep 27, 2022
25ff26c
Add files via upload
Srishti013 Sep 27, 2022
0888ec4
Update Readme.md
Srishti013 Sep 27, 2022
00faa36
Add files via upload
Srishti013 Sep 28, 2022
49bdba8
Delete Regression.ipynb
Srishti013 Sep 30, 2022
b86b2fc
Update Readme.md
Srishti013 Sep 30, 2022
2b49731
Delete Data_Cleaning,_Preparation_and_Classification.ipynb
Srishti013 Sep 30, 2022
4035fd0
Add files via upload
Srishti013 Sep 30, 2022
36b1fea
Add files via upload
Srishti013 Sep 30, 2022
27f71d0
Add files via upload
Srishti013 Sep 30, 2022
0d68a91
Delete Data_Combing_and_Visualization.ipynb
Srishti013 Sep 30, 2022
3f405f8
Update Readme.md
Srishti013 Sep 30, 2022
ac4d338
Uploaded current code for automatic patient data extractor
james-mnld Oct 14, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 0 additions & 13 deletions .gitignore

This file was deleted.

Binary file added Data_Analysis_and_Machine_Learning/FMT.pptx
Binary file not shown.
32 changes: 32 additions & 0 deletions Data_Analysis_and_Machine_Learning/Readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
## Data_Analysis_and_Machine_Learning

* Through this code the data collected was:<br>
* Cleaned<br>
* Analysed<br>
* Preprocessed<br>
* Used for machine learning<br>

* To run these scripts you should have access to [this](https://drive.google.com/drive/folders/1e7VH-aApdMa6oUCCbxbKbKHUNHiO8RR_?usp=sharing) folder.
<hr>

### [Data Combing and Visualization](https://github.com/Srishti013/HNC_project/blob/Srishti/Data_Analysis_and_Machine_Learning/data_combing_and_visualization.py)

* The present datasets were concatenated into a single data file using this script.
* This script can further be used to add more data sets into the [master dataset](https://github.com/Srishti013/HNC_project/blob/Srishti/Datafiles/master.csv).
* After that the data was visualized using histograms and heatmap
<hr>

### [Classification](https://github.com/Srishti013/HNC_project/blob/Srishti/Data_Analysis_and_Machine_Learning/data_cleaning%2C_preparation_and_classification.py)

* Through this script the data was preprocessed in order to get it into classification model
* Later 15 classification models were used to predict whether the patient would require replanning or not based on the data of first 12 fractions of the patient.
* The best accuracy score was 0.778 by extra trees classifier.
<hr>

### [Regression](https://github.com/Srishti013/HNC_project/blob/Srishti/Data_Analysis_and_Machine_Learning/regression.py)
* This script involves preprocessing the data for regression model
* Here I used 3 regression models but the results were not good because right now we only have data of 31 replanned patients which do not have any strong trend.
* But this script can be used when we have more data or stronger influential factors.
<hr>

* [This](https://github.com/Srishti013/HNC_project/blob/Srishti/Data_Analysis_and_Machine_Learning/FMT.pptx) presentation can be used to understand more about the project and the work done.
Original file line number Diff line number Diff line change
@@ -0,0 +1,178 @@
# -*- coding: utf-8 -*-
"""Data_Cleaning, Preparation and Classification.ipynb

Automatically generated by Colaboratory.

Original file is located at
https://colab.research.google.com/drive/1cXj1MyACqzmYnsRYabn4_bTWYPtzc3vM
"""

import os
import pandas as pd
import numpy as np
import seaborn as sns
import scipy as sp
import os
import matplotlib.pyplot as plt
from matplotlib import rcParams
from scipy.stats import sem
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.linear_model import SGDClassifier
import seaborn as sns
!pip install catboost
from catboost import CatBoostClassifier

master = pd.read_csv('Master_file')

len(master.columns.tolist())

# Deleting columns with more than 30% null values

for col in master.columns:
if master[col].isna().sum() >= 0.3 * len(master):
del master[col]
cols = master.columns.tolist()
len(cols)

# Dropping numerical features with < 3 unique values
for col in master.columns:
if master[col].nunique() < 3 and master[col].dtype == 'float64':
print(f"{col} {master[col].dtype}: {master[col].unique()}\n\n")
del master[col]
cols = master.columns.tolist()
len(cols)

# List of columns having categorical data
obj_df = master.select_dtypes(include=['object']).copy()
obj_df.head()

# List of columns having numerical data
num_df = master.select_dtypes(include=['int64','float64']).copy()
num_df.head()

# Filled Nan with mode
print(master['cancer_category_id'].value_counts())
master['cancer_category_id'].fillna(master['cancer_category_id'].mode()[0],inplace=True)
print(master['cancer_category_id'].isna().sum())

# filling numerical variables with median
for col in num_df.columns:
if master[col].isna().sum() > 0:
print(f"{col} {master[col].dtype}: {master[col].isna().sum()}" )
master[col].fillna(master[col].median(),inplace=True)
print(f"{col} {master[col].dtype}: {master[col].isna().sum()}" )

# filling categorical variables with mode
for col in obj_df.columns:
if master[col].isna().sum() > 0:
print(f"{col} {master[col].dtype}: {master[col].isna().sum()}" )
master[col].fillna(master[col].mode()[0],inplace=True)
print(f"{col} {master[col].dtype}: {master[col].isna().sum()}" )

for col in obj_df:
master[col] = master[col].astype('category')
master[col] = master[col].cat.codes
master.head()

cols = master.columns.tolist()
list(enumerate(cols))

for i in range(13,27):
del master[f'xmin-slope_Body-{i}']

for i in range(13,27):
del master[f'xmed-slope_Body-{i}']

for i in range(13,27):
del master[f'xave-slope_Body-{i}']

for i in range(13,27):
del master[f'volume-slope_body_Body-{i}']

for i in range(13,27):
del master[f'volume-slope_outer-PTV_Body-{i}']

for i in range(13,27):
del master[f'volume-ratio-slope_inner-PTV_Body-{i}']

for i in range(13,27):
del master[f'volume-ratio-slope_outer-PTV_Body-{i}']

cols = master.columns.tolist()
list(enumerate(cols))

master.shape

master.head()

# Input
X = master.copy()
del X['replanned_or_not']
del X['patient_num']
X.head()

# Output Labels
y = pd.DataFrame(master['replanned_or_not'])
y.head()

# Data Split
# diving into 70%(train) and 30%(test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
X_train.shape, y_train.shape

X_test.shape, y_test.shape

names = ["Nearest_Neighbors", "Cat_Boost", "Linear_SVM", "Polynomial_SVM", "RBF_SVM", "Gaussian_Process",
"Gradient_Boosting", "Decision_Tree", "Extra_Trees", "Random_Forest", "Neural_Net", "AdaBoost",
"Naive_Bayes", "QDA", "SGD"]

classifiers = [
KNeighborsClassifier(3),
CatBoostClassifier(iterations=5, learning_rate=0.1, ),
SVC(kernel="linear", C=0.025),
SVC(kernel="poly", degree=3, C=0.025),
SVC(kernel="rbf", C=1, gamma=2),
GaussianProcessClassifier(1.0 * RBF(1.0)),
GradientBoostingClassifier(n_estimators=100, learning_rate=1.0),
DecisionTreeClassifier(max_depth=5),
ExtraTreesClassifier(n_estimators=10, min_samples_split=2),
RandomForestClassifier(max_depth=5, n_estimators=100),
MLPClassifier(alpha=1, max_iter=1000),
AdaBoostClassifier(n_estimators=100),
GaussianNB(),
QuadraticDiscriminantAnalysis(),
SGDClassifier(loss="hinge", penalty="l2")]

scores = []
for name, clf in zip(names, classifiers):
clf.fit(X_train, y_train)
score = clf.score(X_test, y_test)
scores.append(score)

scores

df = pd.DataFrame()
df['name'] = names
df['score'] = scores
df

cm = sns.light_palette("green", as_cmap=True)
s = df.style.background_gradient(cmap=cm)
s

sns.set(style="whitegrid")
ax = sns.barplot(y="name", x="score", data=df)
Loading