-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathweek_2_lab.Rmd
176 lines (125 loc) · 5.79 KB
/
week_2_lab.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
---
title: "EDA and data visualization"
author: "Monica Alexander"
date: "January 18 2022"
output:
pdf_document:
number_sections: true
toc: true
latex_engine: xelatex
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
```
# Overview
This week we will be going through some exploratory data analysis (EDA) and data visualization steps in R. The aim is to get you used to working with real data (that has issues) to understand the main characteristics and potential issues.
We will be using the [`opendatatoronto`](https://sharlagelfand.github.io/opendatatoronto/) R package, which interfaces with the City of Toronto Open Data Portal.
A good resource is part 1 (especially chapters 3 and 7) of 'R for Data Science' by Hadley Wickham, available for free here: https://r4ds.had.co.nz/.
## What to hand in via GitHub
**There are exercises at the end of this lab**. Please make a new .Rmd file with your answers, call it something sensible (e.g. `week_2_lab.Rmd`), commit to your git repo from last week, and push to GitHub. Due on Friday by 9am.
## A further note on RStudio projects
I mentioned projects last week, but I'd just like to remind everyone that they're a good thing to use. The main advantage of projects is that you don't have to set the working directory or type the whole file path to read in a file (for example, a data file). So instead of reading a csv from `"~/Documents/applied-stats/data/"` you can just read it in from `data/`.
**This is super useful for your assignments**. For the assignments:
- You are expected to hand in the Rmd files, and I should be able to compile these with no errors
- I cannot read in a file that has a path from your local machine.
## A note on packages
If you are running this Rmd on your local machine, you may need to install various packages used (using the `install.packages` function).
Load in all the packages we need:
```{r}
library(opendatatoronto)
library(tidyverse)
library(stringr)
library(skimr) # EDA
library(visdat) # EDA
library(janitor)
library(lubridate)
library(ggrepel)
```
# Lab Exercises
To be handed in via submission of Rmd file to GitHub.
1. Using the `opendatatoronto` package, download the data on mayoral campaign contributions for 2014. Hints:
+ find the ID code you need for the package you need by searching for 'campaign' in the `all_data` tibble above
+ you will then need to `list_package_resources` to get ID for the data file
+ note: the 2014 file you will get from `get_resource` has a bunch of different campaign contributions, so just keep the data that relates to the Mayor election
```{r}
library(opendatatoronto)
l <- list_package_resources('f6651a40-2f52-46fc-9e04-b760c16edd5c')
a <- get_resource(l$id[1])
a <- a$`2_Mayor_Contributions_2014_election.xls`
```
2. Clean up the data format (fixing the parsing issue and standardizing the column names using `janitor`)
```{r}
df <- janitor::clean_names(a)
df <- clean_names(row_to_names(df,1))
```
3. Summarize the variables in the dataset. Are there missing values, and if so, should we be worried about them? Is every variable in the format it should be? If not, create new variable(s) that are in the right format.
There are missing values in many columns. We may need to worry about potential missing values in 'relationship_to_candidate' (we don't know if it's missing or it means there's no relationship), since we will need to do exercise 6 based on its values. We don't need to worry about other missing values since they are not our main interest.
```{r}
skim(df)
```
```{r}
df$contribution_amount1 <- as.integer(df$contribution_amount)
```
4. Visually explore the distribution of values of the contributions. What contributions are notable outliers? Do they share a similar characteristic(s)? It may be useful to plot the distribution of contributions without these outliers to get a better sense of the majority of the data.
```{r}
p <- ggplot(df, aes(x=contribution_amount1)) +
geom_boxplot()+
scale_x_log10()
p
# the outliers starts at about 6000~7000
filter(df,contribution_amount1>6000)
#
```
We see that the 'relationship_to_candidate' for all outliers are 'Candidate'.
```{r}
p <- ggplot(filter(df,contribution_amount1<6000), aes(x=contribution_amount1)) +
geom_boxplot()
p
```
5. List the top five candidates in each of these categories:
+ total contributions
+ mean contribution
+ number of contributions
```{r}
df %>%
select(c(contribution_amount1,candidate)) %>%
group_by(candidate) %>%
summarize(total=sum(contribution_amount1)) %>%
top_n(5)
df %>%
select(c(contribution_amount1,candidate)) %>%
group_by(candidate) %>%
summarize(mean=mean(contribution_amount1)) %>%
top_n(5)
df %>%
select(c(contribution_amount1,candidate)) %>%
group_by(candidate) %>%
summarize(number=n()) %>%
top_n(5)
```
6. Repeat 5 but without contributions from the candidates themselves.
```{r}
df1 <- filter(df,is.na(relationship_to_candidate)|relationship_to_candidate!='Candidate')
df1 %>%
select(c(contribution_amount1,candidate)) %>%
group_by(candidate) %>%
summarize(total=sum(contribution_amount1)) %>%
top_n(5)
df1 %>%
select(c(contribution_amount1,candidate)) %>%
group_by(candidate) %>%
summarize(mean=mean(contribution_amount1)) %>%
top_n(5)
df1 %>%
select(c(contribution_amount1,candidate)) %>%
group_by(candidate) %>%
summarize(number=n()) %>%
top_n(5)
```
7. How many contributors gave money to more than one candidate?
```{r}
nrow(df %>%
group_by(contributors_name,contributors_postal_code) %>%
summarize(n=length(unique(candidate))) %>%
filter(n>1))
```