-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathreport-old.Rmd
170 lines (151 loc) · 7.14 KB
/
report-old.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
---
title: "Assignment1-DataScience-draft"
author: "Amir George"
date: "April 14, 2016"
output: html_document
---
# NOTE: This is an incomplete experiment that mainly used the `caret` package but encountered some problems regarding efficiency, please refere to `report.Rmd` for the final and complete version that uses the `RWeka` package.
## Attaching the required libraries
```{r message=FALSE}
library(plyr)
library(dplyr)
library(knitr)
library(tidyr)
library(caret)
library(RWeka)
library(e1071)
library(randomForest)
library(kernlab)
library(klaR)
library(rpart)
library(ipred)
library(fastAdaboost)
```
# Helper functions
Here we define helper functions that will be repeatedly used throughout the assignment.
## `getEvaluation` function
The function `getEvaluation` extracts the required classification evaluation measures from a given confusion matrix.
```{r}
getEvaluation <- function(confMat) {
tp <- confMat$table[1,1]
fn <- confMat$table[1, 2]
fp <- confMat$table[2, 1]
tn <- confMat$table[2, 2]
Accurary <- (tp + tn) / (tp + fn + fp + tn)
Precision <- tp / (tp + fp)
Recall <- tp / (tp + fn)
F1 <- (2 * Precision * Recall) / (Precision + Recall)
res <- cbind(Accurary,Precision,Recall,F1)
return(res)
}
```
# Reading the sonar dataset
We read the whole dataset, and give the target output column a clear name.
```{r cache=TRUE}
#sonarDf <- read.csv('C:/Users/Amir George/Rworkspace/Assignment1-DataScience/csen1061-assignment-modeling/datasets/sonar.data', header = FALSE)
sonarDf <- read.csv('datasets/sonar.data', header = FALSE)
names(sonarDf)[names(sonarDf)=="V61"] <- "TargetOutput"
```
# Section 2 in assignment description (with a C4.5 decision tree)
In this section we construct a C4.5 decision tree on the sonar dataset, using the entire dataset as the learning set, and showcase the different classification evaluation measures. We consider the metal class as the positive class.
```{r cache=TRUE}
#sonarDTall <- J48(TargetOutput ~ ., data = sonarDf)
sonarDTall <- train(TargetOutput ~ ., data=sonarDf, method="J48")
#sonarDTall %>% getEvaluation("model") %>% kable
predictions <- predict(sonarDTall, sonarDf[,1:60])
confusionMatrix(predictions, sonarDf[,61]) %>% getEvaluation %>% kable
```
We should note that the above measures are not reliable since training and testing were done on the same dataset, which should never be done due to overfitting. To avoid this problem, we will use stratified 10-fold cross-validation.
```{r cache=TRUE}
tenFoldCVControl <- trainControl(method = "cv", number = 10)
sonarDTtenFoldCV <- train(TargetOutput ~ ., data=sonarDf, method="J48",trControl=tenFoldCVControl)
sonarDTtenFoldCV %>% confusionMatrix %>% getEvaluation %>% kable
```
# Section 2 in assignment description (with a Cart decision tree)
Training and testing on the same whole dataset:
```{r cache=TRUE}
sonarDTall <- train(TargetOutput ~ ., data=sonarDf, method="rpart")
predictions <- predict(sonarDTall, sonarDf[,1:60])
confusionMatrix(predictions, sonarDf[,61]) %>% getEvaluation %>% kable
```
Using stratified 10-fold cross-validation:
```{r cache=TRUE}
tenFoldCVControl <- trainControl(method = "cv", number = 10)
sonarDTtenFoldCV <- train(TargetOutput ~ ., data=sonarDf, method="rpart",trControl=tenFoldCVControl)
sonarDTtenFoldCV %>% confusionMatrix %>% getEvaluation %>% kable
```
## Section 3 in assignment description
In this section we will also train and test using 10-fold CV on the sonar dataset but with other classification algorithms.
### Random Forest
```{r cache=TRUE}
sonarDTtenFoldCV <- train(TargetOutput ~ ., data=sonarDf, method="rf",trControl=tenFoldCVControl)
sonarDTtenFoldCV %>% confusionMatrix %>% getEvaluation %>% kable
```
###Support Vector Machines (SVM)
Out of all the SVM methods offered by the caret package, we chose the `svmRadial` method.
```{r cache=TRUE}
sonarDTtenFoldCV <- train(TargetOutput ~ ., data=sonarDf, method="svmRadial",trControl=tenFoldCVControl)
sonarDTtenFoldCV %>% confusionMatrix %>% getEvaluation %>% kable
```
### Naive Bayes
```{r cache=TRUE, warning=FALSE}
sonarDTtenFoldCV <- train(TargetOutput ~ ., data=sonarDf, method="nb",trControl=tenFoldCVControl)
sonarDTtenFoldCV %>% confusionMatrix %>% getEvaluation %>% kable
```
### Neural Networks
```{r cache=TRUE, message=FALSE, results = "hide"}
sonarDTtenFoldCV <- train(TargetOutput ~ ., data=sonarDf, method="nnet",trControl=tenFoldCVControl)
```
```{r cache=TRUE}
sonarDTtenFoldCV %>% confusionMatrix %>% getEvaluation %>% kable
```
### Bagging
Now we will implement using the ensemble learning method, using Cart as the base classifier, again with stratified 10-fold CV.
```{r cache=TRUE, warning=FALSE}
sonarDTtenFoldCV <- train(TargetOutput ~ ., data=sonarDf, method="treebag",trControl=tenFoldCVControl)
sonarDTtenFoldCV %>% confusionMatrix %>% getEvaluation %>% kable
```
### Boosting
Now we will implement using Adaboost, again with stratified 10-fold CV.
```{r cache=TRUE}
sonarDTtenFoldCV <- train(TargetOutput ~ ., data=sonarDf, method="adaboost",trControl=tenFoldCVControl)
sonarDTtenFoldCV %>% confusionMatrix %>% getEvaluation %>% kable
```
## Section 4 in assignment description
Now we proceed to import other datasets, to test the multiple algorithms on them.
### Importing Hepatitis dataset
Here the class attribute is the first attribute, as stated in `hepatitis.names` on the UCI repository.
```{r cache=TRUE}
#hepatitisDf <- read.csv('C:/Users/Amir George/Rworkspace/Assignment1-DataScience/csen1061-assignment-modeling/datasets/hepatitis.data', header = FALSE)
hepatitisDf <- read.csv('datasets/hepatitis.data', header = FALSE)
names(hepatitisDf)[names(hepatitisDf)=="V1"] <- "TargetOutput"
hepatitisDf <- hepatitisDf %>% mutate(TargetOutput = as.factor(TargetOutput))
```
### Importing Spect dataset
Here the class attribute is the first attribute, as stated in `SPECT.names` on the UCI repository.
```{r cache=TRUE}
#spectDf <- read.csv('C:/Users/Amir George/Rworkspace/Assignment1-DataScience/csen1061-assignment-modeling/datasets/SPECT.data', header = FALSE)
spectDf <- read.csv('datasets/SPECT.data', header = FALSE)
names(spectDf)[names(spectDf)=="V1"] <- "TargetOutput"
spectDf <- spectDf %>% mutate(TargetOutput = as.factor(TargetOutput))
```
### Importing Pima-indians dataset
Here the class attribute is the ninth attribute, as stated in `pima-indians-diabetes.names` on the UCI repository.
```{r cache=TRUE}
#pimaDf <- read.csv('C:/Users/Amir George/Rworkspace/Assignment1-DataScience/csen1061-assignment-modeling/datasets/pima-indians-diabetes.data', header = FALSE)
pimaDf <- read.csv('datasets/pima-indians-diabetes.data', header = FALSE)
names(pimaDf)[names(pimaDf)=="V9"] <- "TargetOutput"
pimaDf <- pimaDf %>% mutate(TargetOutput = as.factor(TargetOutput))
```
### Results of 10 times 10 CV for each dataset & algorithm pair
The function `doCV10by10` takes a dataframe and an algorithm as input and performs 10 by 10 CV, then outputs our required evaluation metrics.
```{r}
doCV10by10 <- function(df,algorithm) {
tenByTenCVControl <- trainControl(method = "repeatedcv", number = 10, repeats = 10)
model <- train(TargetOutput ~ ., data=df, method=algorithm,trControl=tenByTenCVControl)
model %>% confusionMatrix %>% getEvaluation %>% kable
}
```
```{r}
#doCV10by10(hepatitisDf,"rf")
```