-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathTitanic RMD.Rmd
397 lines (285 loc) · 13.3 KB
/
Titanic RMD.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
---
title: "Titanic Survival"
author: "Travis Barton"
date: "7/24/2018"
output: html_document
---
```{r setup and Libraries, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(readr)
library(stringr)
library(e1071)
train <- read_csv("~/Desktop/Projects/Titanic/train.csv")
test <- read_csv("~/Desktop/Projects/Titanic/test.csv")
library(caret)
```
This guide should serve as a first time data analysts' stepping stone into the basic principals of statistics. Our goal will be to understand the data, then only then analyse and predict it.
# 1. Data Clean
### Pclass
This represents the passenger class of the ship. This should directly correlate with socioeconomic status. Lets take a look at how they fared in terms of survival (pun intended).
```{r pclass graph, echo=FALSE}
barplot(table(train$Survived, train$Pclass), col = c("dark blue", "orange"))
legend("topleft", fill=c("dark blue", "orange"), legend=c("Died", "Survived"))
```
So it seems that a better passenger class corresponds with better survival. This means we should keep the variable for our final model.
### Name
'Name' will be likely be the least helpful variable of the bunch. But that's not to say that there is nothing to be learned from it. The names themselves are not all that informative, but the titles that go along with the names will be. Their usefulness will be seen more in the 'age' stage, but for now, lets extract the titles.
```{r extract the titles, echo=FALSE}
train$title <- str_sub(train$Name, str_locate(train$Name, ",")[ , 1] + 2, str_locate(train$Name, "\\.")[ , 1] - 1)
barplot(table( train$Survived,train$title), col = rainbow(3), las = 2)
legend("topleft", fill=rainbow(3), legend=c("Survived", "Died"))
```
So while many may be hard to see, Mr. seems to correlate with low survival rates, and Mrs/Miss seem to mean higher survival rates. This will be useful to us. In fact, the rest are so low, that we might as well combine them into their own category
```{r combineing useless titles, echo=FALSE}
index <- which(train$title %in% c("Master", "Miss", "Mr", "Mrs", "Rev"))
train$title[-index] = "Other"
barplot(table( train$Survived,train$title), col = rainbow(3), las = 2)
legend("topright", fill=rainbow(3), legend=c("Died", "Survuved"))
```
### Sex
Lets see how sex relates to survival, and how many missing values we have.
```{r sex-examination, echo=FALSE}
barplot(table(train$Survived, train$Sex), col = rainbow(3))
legend("topleft", fill=rainbow(3), legend=c("Died", "Survived"))
```
Well it looks like being female on the titanic increased your likelihood of survival. This means that it will be a useful predictor. What is more, we have no empty sex's, so no imputation is nessisary.
### Age
Age will be a troublesome variable. 19.86% of the ages are missing. With so many missing values, it would not be practical to just give them the same value (whether mean or median imputation). Doing so would cause us to miss much of the information that can be gleamed.
This is where the title variable will come in. We can get a more accurate estimate of the ages of the passengers if we impute the age based on their title.
```{r age imputaion}
Master_age <- median(train$Age[which(train$title == "Master" & is.na(train$Age) == F)])
Miss_age <- median(train$Age[which(train$title == "Miss" & is.na(train$Age) == F)])
Mr_age <- median(train$Age[which(train$title == "Mr" & is.na(train$Age) == F)])
Mrs_age <- median(train$Age[which(train$title == "Mrs" & is.na(train$Age) == F)])
Other_age <- median(train$Age[which(train$title == "Other" & is.na(train$Age) == F)])
Rev_age <- median(train$Age[which(train$title == "Rev" & is.na(train$Age) == F)])
for(i in 1:891)
{
if(is.na(train$Age[i]) == T)
{
if(train$title[i] == "Master")
{
train$Age[i] = Master_age
}
else if(train$title[i] == "Miss")
{
train$Age[i] = Miss_age
}
else if(train$title[i] == "Mr")
{
train$Age[i] = Mr_age
}
else if(train$title[i] == "Mrs")
{
train$Age[i] = Mrs_age
}
else if(train$title[i] == "Other")
{
train$Age[i] = Other_age
}
else
{
train$Age[i] = Rev_age
}
}
}
```
Now that we have a more accurate estimate of the ages, lets take a look at how they do in terms of survival.
```{r, echo=FALSE}
index <- which(train$Survived == 1)
boxplot(train$Age[index], train$Age[-index], names = c("Survived", "Died") )
```
It seems that the people that survived had a slightly lower age then those who lived. But the picture is not clear. Lets try binning the ages and see if that can clarify what is happening.
```{r binning ages, include=FALSE}
child <- 12
teenager <- 19
young_adult <- 26
adult <- 59
senior <- 60
train$agebin <- {}
for(i in 1:891)
{
if(train$Age[i] < child)
{
train$agebin[i] = 1
}
else if(train$Age[i] < teenager)
{
train$agebin[i] = 2
}
else if(train$Age[i] < young_adult)
{
train$agebin[i] = 3
}
else if(train$Age[i] < adult)
{
train$agebin[i] = 4
}
else
{
train$agebin[i] = 5
}
}
```
```{r, echo=FALSE}
barplot(table(train$Survived,train$agebin), col = rainbow(3), names.arg = c("< 12", "12-19", "19-26", "26-59", "> 59"), las = 1.5)
legend("topleft", fill=rainbow(3), legend=c("Survived", "Died"))
```
This is still not looking very useful, however. Age may have to be dropped.
### SibSp
This variable represents the number of siblings that a person had on board. My intuition tells me that larger families will be able to get off of the boat more easily, but Lets see what the data has to say.
```{r sibsp examination}
barplot(table(train$Survived,train$SibSp), col = rainbow(5))
legend("topright", fill=rainbow(5), legend=c("Survived", "Died"))
```
My intuition seems to be wrong. Lone passengers seem to be better off than passengers in a family unit. However dark that implication may be, it will give our model valuable insight.
### Parch
Parch measures the number of parents or children that were aboard the ship. Since SibSp surprised me by implying being alone was the best for surviving, I suspect that Parch will agree, but lets take a look at the numbers first.
```{r Parch examination}
barplot(table(train$Survived, train$Parch), col = rainbow(10))
legend("topright", fill=rainbow(10), legend=c("Survived", "Died"))
```
Again, traveling alone increases your chances of survival. While this is bad for the families, it will mean a good predictor for our model.
#### Ticket
Ticket does not seem to have any useable information. It seems that the pattern may contain some hidden usefulness, but how does not jump out to my eye.
```{r Ticket Examination}
head(train$Ticket)
```
#### Fare
It was established by Pclass that the socioeconomic class of the passengers may contribute to their survival, and fare seems to be another good indicator of such. I suspect that it will have useful information inside of it.
There are a lot of fares, but I think they they can be split up into certain catagories. Lets look at which fares lead to poor survival and lump them all together.
```{r Fare Examination}
temp <- table(train$Survived, round(train$Fare, -2))
temp2 <- temp[1,]/colSums(temp)
index <- which(temp2 < .5)
names(index) = {}
Surviving_fares <- as.numeric(colnames(temp)[index])
Dying_fares <- as.numeric(colnames(temp)[-index])
train$whichfare <- NA
index <- which(round(train$Fare, -2) %in% Surviving_fares)
train$whichfare[index] = 1
index <- which(round(train$Fare, -2) %in% Dying_fares)
train$whichfare[index] = 0
```
```{r Fare Barplot, echo=FALSE}
barplot(table(train$Survived, train$whichfare), col = rainbow(10), names.arg = c("Bad Fares", "Good Fares"), las = 2)
legend("topright", fill=rainbow(10), legend=c("Died", "Survived"))
```
If a given fare has less than half survival, it was labeled a "bad fare" otherwise is was a "good fare." I decided to bin by 100s so that we would not have a group with too small of a sample size.
### Cabin
Because cabin is so sparse, I am opting to eliminate it from our data set. Over 77% of the entries are NA, and since I do not know the connection between cabin and survival, I am not comfortable trying to extrapolate from such a spare set.
```{r cabin}
length(which(is.na(train$Cabin) == T))/nrow(train)
```
### Embarked
Embarked describes which port the passanger left from. This may have an impact, as different ports might pick up different kinds of passangers on average.
```{r Embarked, echo = FALSE}
barplot(table(train$Survived, train$Embarked), col = rainbow(5))
legend("topleft", fill=rainbow(5), legend=c("Died", "Survived"))
```
It seems that the port of takeoff matters, but there are a dirastic difference in the sample sizes. This might become an issue, as the distributions with less people are not being represented to their fullest extent. I will keep it in for now and attempt a variable reduction in the model stages to see if Embarked should stay.
It is important to note that there is a few missing values of embarked. There are not many, so I will just fill in the missing values with a sample of the existing values of Embarked
```{r Embakred Imputation}
index <- which(is.na(train$Embarked) == T)
set.seed(6)
train$Embarked[index] = sample(train$Embarked[-index], 2)
```
# 2. Models
Now that the data is cleaned and cannidates have been chosen, it is time to make the dicision of predictors and models. Some, like 'Sex' are gauranteed choices based on theory and concensis. Others, like 'age' will be tested and chosen only if the models where it is included preform well.
We will experiment with the following models:
* SVM
* KNN
* Random Forest
* Logistic Regression
* Linear Regression
Each initiall tested with the following variables:
* Pclass
* Title
* Sex
* agebin
+ The new binned ages that I made
+ This will cause co-linearity issues if I use it in conjunction with Title, as the imputation of age was done using the passanger titles. As such, I will include one or the other.
* SibSp
* Parch
* whichfare
+ Whether they had a good fare or bad fare
* Embarked
I will make a new data set with these variables called 'dat' that will also turn all our factors into numbers. In this step I will also scale the columns so that they revolve around 0.
```{r new data set, include=TRUE}
#create dat
Survived <- train$Survived
dat <- cbind(train$title, train$agebin, train$Pclass, train$Sex, train$SibSp, train$Parch, train$whichfare, train$Embarked)
colnames(dat) <- c("Title", "Agebin", "Pclass", "Sex", "SibSp", "Parch", "Whichfare", "Embarked")
dat <- as.data.frame(dat)
dat$Title <- factor(dat$Title, labels = 1:6)
dat$Sex <- factor(dat$Sex, labels = 0:1)
dat$Embarked <- factor(dat$Embarked, labels = 1:3)
#create scaled dat
dat.scaled <- dat
for(i in 1:ncol(dat))
{
dat.scaled[,i] = as.numeric(dat[,i])
}
dat.scaled <- scale(dat.scaled)
#re-attach sruvived
dat <- cbind(Survived, dat)
dat.scaled <- cbind(Survived, dat.scaled)
dat.scaled <- as.data.frame(dat.scaled)
```
### SVM
I'm choosing to start with SVM because it is relitively easy to impliment, and is generally relaible. There are ways to improve its preformance that we will not persue in this elementary example such as PCA/MCA.
Since we do not have extraordiarily large data, we will use the e1071 package. For large data sets, I recomend the liquidSVM package. We will also need to split our data into a train and test set so that we can see how our models do before we submit the data into kaggle. To do so, I wrote a little function called cross_val_maker to test our models and a function called percent to return the accuracy.
```{r my functions}
Cross_val_maker <- function(data, alpha)
{
if(alpha > 1 || alpha <= 0)
{
return("Alpha must be between 0 and 1")
}
index <- sample(c(1:nrow(data)), round(nrow(data)*alpha))
train <- data[-index,]
test <- data[index,]
return(list("Train" = as.data.frame(train), "Test" = as.data.frame(test)))
}
Percent <- function(true, test)
{
return(sum(diag(table(true, test)))/sum(table(true, test)))
}
```
```{r SVM fit}
#Scaled Data
set.seed(7)
Cross_val_holder <- Cross_val_maker(dat.scaled, .1)
Train.scaled <- Cross_val_holder$Train
Test.scaled <- Cross_val_holder$Test
svmfit <- svm(Survived ~ ., data = Train.scaled, type = "C-classification")
results <- predict(svmfit, Test.scaled[,-1])
Percent(Test.scaled[,1], results)
# 80.89888%
# Normal data
set.seed(7)
Cross_val_holder <- Cross_val_maker(dat, .1)
Train.scaled <- Cross_val_holder$Train
Test.scaled <- Cross_val_holder$Test
svmfit <- svm(Survived ~ ., data = Train.scaled, type = "C-classification")
results <- predict(svmfit, Test.scaled[,-1])
Percent(Test.scaled[,1], results)
```
Our first run comes out to about 81% for the scaled data and about 76% for the non-scaled data. This is not bad, but lets do some repeated sampling to make sure that we have the best accuracy that we can get.
```{r SVM repeated sampling}
# PICK UP HERE NEXT TIME
```
# Results
* how it did
* how it was created
* why didnt preform as well
* interesting thing
+ consistantly even results
+ In testing it did okay
+ ways it can be imroved
```{r references}
#https://www.tutorialspoint.com/r/r_mean_median_mode.htm
#https://stackoverflow.com/questions/9981929/how-to-display-all-x-labels-in-r-barplot
#https://www.kaggle.com/tysonni/extracting-passenger-titiles-in-r
#https://www.kaggle.com/nadintamer/titanic-survival-predictions-beginner
#https://stat.ethz.ch/R-manual/R-devel/library/stats/html/ksmooth.html