Titanic RMD.Rmd

---
title: "Titanic Survival"
author: "Travis Barton"
date: "7/24/2018"
output: html_document
---

```{r setup and Libraries, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(readr)
library(stringr)
library(e1071)
train <- read_csv("~/Desktop/Projects/Titanic/train.csv")
test <- read_csv("~/Desktop/Projects/Titanic/test.csv")
library(caret)

```
This guide should serve as a first time data analysts' stepping stone into the basic principals of statistics. Our goal will be to understand the data, then only then analyse and predict it.


# 1. Data Clean
### Pclass

This represents the passenger class of the ship. This should directly correlate with socioeconomic status. Lets take a look at how they fared in terms of survival (pun intended).

```{r pclass graph, echo=FALSE}
barplot(table(train$Survived, train$Pclass), col = c("dark blue", "orange"))
legend("topleft", fill=c("dark blue", "orange"), legend=c("Died", "Survived"))
```

So it seems that a better passenger class corresponds with better survival. This means we should keep the variable for our final model.

    
### Name

'Name' will be likely be the least helpful variable of the bunch. But that's not to say that there is nothing to be learned from it. The names themselves are not all that informative, but the titles that go along with the names will be. Their usefulness will be seen more in the 'age' stage, but for now, lets extract the titles.

```{r extract the titles, echo=FALSE}
train$title <- str_sub(train$Name, str_locate(train$Name, ",")[ , 1] + 2, str_locate(train$Name, "\\.")[ , 1] - 1)

barplot(table( train$Survived,train$title), col = rainbow(3), las = 2)
legend("topleft", fill=rainbow(3), legend=c("Survived", "Died"))
```

So while many may be hard to see, Mr. seems to correlate with low survival rates, and Mrs/Miss seem to mean higher survival rates. This will be useful to us. In fact, the rest are so low, that we might as well combine them into their own category

```{r combineing useless titles, echo=FALSE}
index <- which(train$title %in% c("Master", "Miss", "Mr", "Mrs", "Rev"))
train$title[-index] = "Other"
barplot(table( train$Survived,train$title), col = rainbow(3), las = 2)
legend("topright", fill=rainbow(3), legend=c("Died", "Survuved"))

```

### Sex

Lets see how sex relates to survival, and how many missing values we have.


```{r sex-examination, echo=FALSE}
barplot(table(train$Survived, train$Sex), col = rainbow(3))
legend("topleft", fill=rainbow(3), legend=c("Died", "Survived"))

```

Well it looks like being female on the titanic increased your likelihood of survival. This means that it will be a useful predictor. What is more, we have no empty sex's, so no imputation is nessisary.

### Age

Age will be a troublesome variable. 19.86% of the ages are missing. With so many missing values, it would not be practical to just give them the same value (whether mean or median imputation). Doing so would cause us to miss much of the information that can be gleamed.   

This is where the title variable will come in. We can get a more accurate estimate of the ages of the passengers if we impute the age based on their title.


```{r age imputaion}
Master_age <- median(train$Age[which(train$title == "Master" & is.na(train$Age) == F)])
Miss_age <- median(train$Age[which(train$title == "Miss" & is.na(train$Age) == F)])
Mr_age <- median(train$Age[which(train$title == "Mr" & is.na(train$Age) == F)])
Mrs_age <- median(train$Age[which(train$title == "Mrs" & is.na(train$Age) == F)])
Other_age <- median(train$Age[which(train$title == "Other" & is.na(train$Age) == F)])
Rev_age <- median(train$Age[which(train$title == "Rev" & is.na(train$Age) == F)])

for(i in 1:891)
{
  if(is.na(train$Age[i]) == T)
  {
    if(train$title[i] == "Master")
    {
      train$Age[i] = Master_age
    }
    else if(train$title[i] == "Miss")
    {
      train$Age[i] = Miss_age
    }
    else if(train$title[i] == "Mr")
    {
      train$Age[i] = Mr_age
    }
    else if(train$title[i] == "Mrs")
    {
      train$Age[i] = Mrs_age
    }
    else if(train$title[i] == "Other")
    {
      train$Age[i] = Other_age
    }
    else
    {
      train$Age[i] = Rev_age
    }
  }
}

```

  
Now that we have a more accurate estimate of the ages, lets take a look at how they do in terms of survival.
  

```{r, echo=FALSE}
index <- which(train$Survived == 1)
boxplot(train$Age[index], train$Age[-index], names = c("Survived", "Died") )

```


It seems that the people that survived had a slightly lower age then those who lived. But the picture is not clear. Lets try binning the ages and see if that can clarify what is happening. 


```{r binning ages, include=FALSE}
child <- 12
teenager <- 19
young_adult <- 26
adult <- 59
senior <- 60
train$agebin <- {}
for(i in 1:891)
{
  if(train$Age[i] < child)
  {
    train$agebin[i] = 1
  }
  else if(train$Age[i] < teenager)
  {
    train$agebin[i] = 2
  }
  else if(train$Age[i] < young_adult)
  {
    train$agebin[i] = 3
  }
  else if(train$Age[i] < adult)
  {
    train$agebin[i] = 4
  }
  else
  {
    train$agebin[i] = 5
  }
}
```
```{r, echo=FALSE}
barplot(table(train$Survived,train$agebin), col = rainbow(3), names.arg = c("< 12", "12-19", "19-26", "26-59", "> 59"), las = 1.5)
legend("topleft", fill=rainbow(3), legend=c("Survived", "Died"))

```

This is still not looking very useful, however. Age may have to be dropped. 


### SibSp

This variable represents the number of siblings that a person had on board. My intuition tells me that larger families will be able to get off of the boat more easily, but Lets see what the data has to say. 

```{r sibsp examination}
barplot(table(train$Survived,train$SibSp), col = rainbow(5))
legend("topright", fill=rainbow(5), legend=c("Survived", "Died"))

```

My intuition seems to be wrong. Lone passengers seem to be better off than passengers in a family unit. However dark that implication may be, it will give our model valuable insight. 


### Parch

Parch measures the number of parents or children that were aboard the ship. Since SibSp surprised me by implying being alone was the best for surviving, I suspect that Parch will agree, but lets take a look at the numbers first. 

```{r Parch examination}
barplot(table(train$Survived, train$Parch), col = rainbow(10))
legend("topright", fill=rainbow(10), legend=c("Survived", "Died"))
```

Again, traveling alone increases your chances of survival. While this is bad for the families, it will mean a good predictor for our model.

#### Ticket

Ticket does not seem to have any useable information. It seems that the pattern may contain some hidden usefulness, but how does not jump out to my eye.

```{r Ticket Examination}
head(train$Ticket)

```

#### Fare

It was established by Pclass that the socioeconomic class of the passengers may contribute to their survival, and fare seems to be another good indicator of such. I suspect that it will have useful information inside of it.

There are a lot of fares, but I think they they can be split up into certain catagories. Lets look at which fares lead to poor survival and lump them all together.


```{r Fare Examination}


temp <- table(train$Survived, round(train$Fare, -2))
temp2 <- temp[1,]/colSums(temp)
index <- which(temp2 < .5)
names(index) = {}
Surviving_fares <- as.numeric(colnames(temp)[index])
Dying_fares <- as.numeric(colnames(temp)[-index])

train$whichfare <- NA
index <- which(round(train$Fare, -2) %in% Surviving_fares)
train$whichfare[index] = 1

index <- which(round(train$Fare, -2) %in% Dying_fares)
train$whichfare[index] = 0
```

```{r Fare Barplot, echo=FALSE}
barplot(table(train$Survived, train$whichfare), col = rainbow(10), names.arg = c("Bad Fares", "Good Fares"), las = 2)
legend("topright", fill=rainbow(10), legend=c("Died", "Survived"))
```

If a given fare has less than half survival, it was labeled a "bad fare" otherwise is was a "good fare." I decided to bin by 100s so that we would not have a group with too small of a sample size.

### Cabin

Because cabin is so sparse, I am opting to eliminate it from our data set. Over 77% of the entries are NA, and since I do not know the connection between cabin and survival, I am not comfortable trying to extrapolate from such a spare set.

```{r cabin}
length(which(is.na(train$Cabin) == T))/nrow(train)

```


### Embarked

Embarked describes which port the passanger left from. This may have an impact, as different ports might pick up different kinds of passangers on average.

```{r Embarked, echo = FALSE}
barplot(table(train$Survived, train$Embarked), col = rainbow(5))
legend("topleft", fill=rainbow(5), legend=c("Died", "Survived"))

```

It seems that the port of takeoff matters, but there are a dirastic difference in the sample sizes. This might become an issue, as the distributions with less people are not being represented to their fullest extent. I will keep it in for now and attempt a variable reduction in the model stages to see if Embarked should stay.

It is important to note that there is a few missing values of embarked. There are not many, so I will just fill in the missing values with a sample of the existing values of Embarked

```{r Embakred Imputation}
index <- which(is.na(train$Embarked) == T)
set.seed(6)
train$Embarked[index] = sample(train$Embarked[-index], 2)
```

# 2. Models

Now that the data is cleaned and cannidates have been chosen, it is time to make the dicision of predictors and models. Some, like 'Sex' are gauranteed choices based on theory and concensis. Others, like 'age' will be tested and chosen only if the models where it is included preform well.

We will experiment with the following models:


* SVM
* KNN
* Random Forest
* Logistic Regression
* Linear Regression
  
Each initiall tested with the following variables:


* Pclass
* Title
* Sex
* agebin
    + The new binned ages that I made
    + This will cause co-linearity issues if I use it in conjunction with Title, as the imputation of age was done using the passanger titles. As such, I will include one or the other.
* SibSp
* Parch
* whichfare
    + Whether they had a good fare or bad fare
* Embarked

I will make a new data set with these variables called 'dat' that will also turn all our factors into numbers. In this step I will also scale the columns so that they revolve around 0.

```{r new data set, include=TRUE}
#create dat
Survived <- train$Survived
dat <- cbind(train$title, train$agebin, train$Pclass, train$Sex, train$SibSp, train$Parch, train$whichfare, train$Embarked)
colnames(dat) <- c("Title", "Agebin", "Pclass", "Sex", "SibSp", "Parch", "Whichfare", "Embarked")
dat <- as.data.frame(dat)

dat$Title <- factor(dat$Title, labels = 1:6)
dat$Sex <- factor(dat$Sex, labels = 0:1)
dat$Embarked <- factor(dat$Embarked, labels = 1:3)

#create scaled dat
dat.scaled <- dat
for(i in 1:ncol(dat))
{
  dat.scaled[,i] = as.numeric(dat[,i])
}
dat.scaled <- scale(dat.scaled)

#re-attach sruvived
dat <- cbind(Survived, dat)
dat.scaled <- cbind(Survived, dat.scaled)
dat.scaled <- as.data.frame(dat.scaled)
```


### SVM

I'm choosing to start with SVM because it is relitively easy to impliment, and is generally relaible. There are ways to improve its preformance that we will not persue in this elementary example such as PCA/MCA.

Since we do not have extraordiarily large data, we will use the e1071 package. For large data sets, I recomend the liquidSVM package. We will also need to split our data into a train and test set so that we can see how our models do before we submit the data into kaggle. To do so, I wrote a little function called cross_val_maker to test our models and a function called percent to return the accuracy.

```{r my functions}
Cross_val_maker <- function(data, alpha)
{
  if(alpha > 1 || alpha <= 0)
  {
    return("Alpha must be between 0 and 1")
  }
  index <- sample(c(1:nrow(data)), round(nrow(data)*alpha))
  train <- data[-index,]
  test <- data[index,]
  return(list("Train" = as.data.frame(train), "Test" = as.data.frame(test)))
}
Percent <- function(true, test)
{
  return(sum(diag(table(true, test)))/sum(table(true, test)))
}

```

```{r SVM fit}
#Scaled Data
set.seed(7)
Cross_val_holder <- Cross_val_maker(dat.scaled, .1)
Train.scaled <- Cross_val_holder$Train
Test.scaled <- Cross_val_holder$Test


svmfit <- svm(Survived ~ ., data = Train.scaled, type = "C-classification")
results <- predict(svmfit, Test.scaled[,-1])
Percent(Test.scaled[,1], results)
# 80.89888%

# Normal data
set.seed(7)
Cross_val_holder <- Cross_val_maker(dat, .1)
Train.scaled <- Cross_val_holder$Train
Test.scaled <- Cross_val_holder$Test

svmfit <- svm(Survived ~ ., data = Train.scaled, type = "C-classification")
results <- predict(svmfit, Test.scaled[,-1])
Percent(Test.scaled[,1], results)

```

Our first run comes out to about 81% for the scaled data and about 76% for the non-scaled data. This is not bad, but lets do some repeated sampling to make sure that we have the best accuracy that we can get.

```{r SVM repeated sampling}
# PICK UP HERE NEXT TIME
```
# Results

* how it did
* how it was created
* why didnt preform as well
* interesting thing
  + consistantly even results
  + In testing it did okay
  + ways it can be imroved


```{r references}
#https://www.tutorialspoint.com/r/r_mean_median_mode.htm
#https://stackoverflow.com/questions/9981929/how-to-display-all-x-labels-in-r-barplot
#https://www.kaggle.com/tysonni/extracting-passenger-titiles-in-r
#https://www.kaggle.com/nadintamer/titanic-survival-predictions-beginner
#https://stat.ethz.ch/R-manual/R-devel/library/stats/html/ksmooth.html