week_2_lab.Rmd

---
title: "EDA and data visualization"
author: "Monica Alexander"
date: "January 18 2022"
output: 
    pdf_document:
      number_sections: true
      toc: true
      latex_engine: xelatex
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
```

# Overview

This week we will be going through some exploratory data analysis (EDA) and data visualization steps in R. The aim is to get you used to working with real data (that has issues) to understand the main characteristics and potential issues. 

We will be using the [`opendatatoronto`](https://sharlagelfand.github.io/opendatatoronto/) R package, which interfaces with the City of Toronto Open Data Portal. 

A good resource is part 1 (especially chapters 3 and 7) of 'R for Data Science' by Hadley Wickham, available for free here: https://r4ds.had.co.nz/. 

## What to hand in via GitHub

**There are exercises at the end of this lab**. Please make a new .Rmd file with your answers, call it something sensible (e.g. `week_2_lab.Rmd`), commit to your git repo from last week, and push to GitHub. Due on Friday by 9am. 

## A further note on RStudio projects

I mentioned projects last week, but I'd just like to remind everyone that they're a good thing to use.  The main advantage of projects is that you don't have to set the working directory or type the whole file path to read in a file (for example, a data file). So instead of reading a csv from `"~/Documents/applied-stats/data/"` you can just read it in from `data/`. 

**This is super useful for your assignments**. For the assignments:

- You are expected to hand in the Rmd files, and I should be able to compile these with no errors
- I cannot read in a file that has a path from your local machine. 

## A note on packages

If you are running this Rmd on your local machine, you may need to install various packages used (using the `install.packages` function). 
Load in all the packages we need:

```{r}
library(opendatatoronto)
library(tidyverse)
library(stringr)
library(skimr) # EDA
library(visdat) # EDA
library(janitor)
library(lubridate)
library(ggrepel)
```



# Lab Exercises

To be handed in via submission of Rmd file to GitHub.

1. Using the `opendatatoronto` package, download the data on mayoral campaign contributions for 2014. Hints:
    + find the ID code you need for the package you need by searching for 'campaign' in the `all_data` tibble above
    + you will then need to `list_package_resources` to get ID for the data file
    + note: the 2014 file you will get from `get_resource` has a bunch of different campaign contributions, so just keep the data that relates to the Mayor election
    
    
```{r}
library(opendatatoronto)
l <- list_package_resources('f6651a40-2f52-46fc-9e04-b760c16edd5c')

a <- get_resource(l$id[1])
a <- a$`2_Mayor_Contributions_2014_election.xls`
```
    
2. Clean up the data format (fixing the parsing issue and standardizing the column names using `janitor`)

```{r}
df <- janitor::clean_names(a)
df <- clean_names(row_to_names(df,1))
```

3. Summarize the variables in the dataset. Are there missing values, and if so, should we be worried about them? Is every variable in the format it should be? If not, create new variable(s) that are in the right format.


There are missing values in many columns. We may need to worry about potential missing values in 'relationship_to_candidate' (we don't know if it's missing or it means there's no relationship), since we will need to do exercise 6 based on its values. We don't need to worry about other missing values since they are not our main interest.

```{r}
skim(df)
```

```{r}
df$contribution_amount1 <- as.integer(df$contribution_amount)
```


4. Visually explore the distribution of values of the contributions. What contributions are notable outliers? Do they share a similar characteristic(s)? It may be useful to plot the distribution of contributions without these outliers to get a better sense of the majority of the data. 

```{r}
p <- ggplot(df, aes(x=contribution_amount1)) + 
  geom_boxplot()+
  scale_x_log10()
p
# the outliers starts at about 6000~7000
filter(df,contribution_amount1>6000)

#
```

We see that the 'relationship_to_candidate' for all outliers are 'Candidate'. 

```{r}
p <- ggplot(filter(df,contribution_amount1<6000), aes(x=contribution_amount1)) + 
  geom_boxplot()
p

```

5. List the top five candidates in each of these categories:
    + total contributions
    + mean contribution
    + number of contributions

```{r}
df %>%
  select(c(contribution_amount1,candidate)) %>%
  group_by(candidate) %>% 
  summarize(total=sum(contribution_amount1)) %>% 
  top_n(5)

df %>%
  select(c(contribution_amount1,candidate)) %>%
  group_by(candidate) %>% 
  summarize(mean=mean(contribution_amount1)) %>% 
  top_n(5)

df %>%
  select(c(contribution_amount1,candidate)) %>%
  group_by(candidate) %>% 
  summarize(number=n()) %>% 
  top_n(5)
```


6. Repeat 5 but without contributions from the candidates themselves.

```{r}
df1 <- filter(df,is.na(relationship_to_candidate)|relationship_to_candidate!='Candidate')

df1 %>%
  select(c(contribution_amount1,candidate)) %>%
  group_by(candidate) %>% 
  summarize(total=sum(contribution_amount1)) %>% 
  top_n(5)

df1 %>%
  select(c(contribution_amount1,candidate)) %>%
  group_by(candidate) %>% 
  summarize(mean=mean(contribution_amount1)) %>% 
  top_n(5)

df1 %>%
  select(c(contribution_amount1,candidate)) %>%
  group_by(candidate) %>% 
  summarize(number=n()) %>% 
  top_n(5)
```

7. How many contributors gave money to more than one candidate? 

```{r}
nrow(df %>% 
  group_by(contributors_name,contributors_postal_code) %>% 
  summarize(n=length(unique(candidate))) %>% 
  filter(n>1))
```