-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathfiltering.qmd
151 lines (105 loc) · 6.99 KB
/
filtering.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
# Filters and selections
More often than not, we have more data than we want. Sometimes we need to be rid of that data. In `dplyr`, there's two ways to go about this: filtering and selecting.
**Filtering creates a subset of the data based on criteria**. All records where the count is greater than 10. All records that match "Nebraska". Something like that.
**Selecting simply returns only the fields named**. So if you only want to see School and Attendance, you select those fields. When you look at your data again, you'll have two columns. If you try to use one of your columns that you had before you used select, you'll get an error.
Let's work with our football attendance data to show some examples.
```{r}
#| warning: false
#| message: false
#| results: asis
#| echo: false
library(downloadthis)
library(glue)
dllink <- download_link(
link = "http://mattwaite.github.io/sportsdatafiles/attendance.csv",
button_label = "Download csv file",
button_type = "danger",
has_icon = TRUE,
icon = "fa fa-save",
self_contained = FALSE
)
glue("<pre><p><strong>For this walkthrough:</strong></p><p>{dllink}</p></pre>")
```
First we'll need the tidyverse.
```{r}
#| warning: false
#| message: false
library(tidyverse)
```
Now import the data.
```{r}
attendance <- read_csv('data/attendance.csv')
```
So, first things first, let's say we don't care about all this Air Force, Akron, Alabama crap and just want to see Dear Old Nebraska U. We do that with `filter` and then we pass it a condition.
Before we do that, a note about conditions. Most of the conditional operators you'll understand -- greater than and less than are > and <. The tough one to remember is equal to. In conditional statements, equal to is == not =. If you haven't noticed, = is a variable assignment operator, not a conditional statement. So equal is == and NOT equal is !=.
So if you want to see Institutions equal to Nebraska, you do this:
```{r}
attendance |> filter(Institution == "Nebraska")
```
Or if we want to see schools that had more than half a million people buy tickets to a football game in a season, we do the following. NOTE THE BACKTICKS.
```{r}
attendance |> filter(`2018` >= 500000)
```
But what if we want to see all of the Power Five conferences? We *could* use conditional logic in our filter. The conditional logic operators are `|` for OR and `&` for AND. NOTE: AND means all conditions have to be met. OR means any of the conditions work. So be careful about boolean logic.
```{r}
attendance |> filter(Conference == "Big 10" | Conference == "SEC" | Conference == "Pac-12" | Conference == "ACC" | Conference == "Big 12")
```
But that's a lot of repetitive code. And a lot of typing. And typing is the devil. So what if we could create a list and pass it into the filter? It's pretty simple.
We can create a new variable -- remember variables can represent just about anything -- and create a list. To do that we use the `c` operator, which stands for concatenate. That just means take all the stuff in the parenthesis after the c and bunch it into a list.
Note here: text is in quotes. If they were numbers, we wouldn't need the quotes.
```{r}
powerfive <- c("SEC", "Big Ten", "Pac-12", "Big 12", "ACC")
```
Now with a list, we can use the %in% operator. It does what you think it does -- it gives you data that matches things IN the list you give it.
```{r}
attendance |> filter(Conference %in% powerfive)
```
## Selecting data to make it easier to read
So now we have our Power Five list. What if we just wanted to see attendance from the most recent season and ignore all the rest? Select to the rescue.
```{r}
attendance |> filter(Conference %in% powerfive) |> select(Institution, Conference, `2018`)
```
If you have truly massive data, Select has tools to help you select fields that start_with the same things or ends with a certain word. [The documentation will guide you](https://dplyr.tidyverse.org/reference/select.html) if you need those someday. For 90 plus percent of what we do, just naming the fields will be sufficient.
## Using conditional filters to set limits
Let's return to the problem of one-hit wonders in basketball mucking up our true shooting analysis. How can we set limits in something like a question of who had the best season? Let's grab every player from last season.
```{r}
#| warning: false
#| message: false
#| results: asis
#| echo: false
library(downloadthis)
library(glue)
dllink <- download_link(
link = "http://mattwaite.github.io/sportsdatafiles/players21.csv",
button_label = "Download csv file",
button_type = "danger",
has_icon = TRUE,
icon = "fa fa-save",
self_contained = FALSE
)
glue("<pre><p><strong>For this walkthrough:</strong></p><p>{dllink}</p></pre>")
```
Let's get set up similar to the previous chapter.
```{r}
players <- read_csv("data/players21.csv")
players |>
mutate(trueshooting = (PTS/(2*(FGA + (.44*FTA))))*100) |>
arrange(desc(trueshooting))
```
And ine the COVID season, Weber State's Jake Furgerson is your One Shot One Three True Shooting champion.
In most contests, like the batting title in Major League Baseball, there's a minimum number of X to qualify. In baseball, it's at bats. In basketball, it attempts. So let's set a floor and see how it changes. What if we said you had to have played 100 minutes in a season? The top players in college basketball play more than 1000 minutes in a season. So 100 is not that much. Let's try it and see.
```{r}
players |>
mutate(trueshooting = (PTS/(2*(FGA + (.44*FTA))))*100) |>
arrange(desc(trueshooting)) |>
filter(MP > 100)
```
Now you get South Carolina's Nathan Nelson, who played in 15 games and was on the floor for 113 minutes. So he played regularly, but not a lot. But in that time, he only attempted 14 shots, and made 64 percent of them. In other words, when he shot, he probably scored. He just rarely shot.
So is 100 minutes our level? Here's the truth -- there's not really an answer here. We're picking a cutoff. If you can cite a reason for it and defend it, then it probably works.
## Top list
One last little dplyr trick that's nice to have in the toolbox is a shortcut for selecting only the top values for your dataset. Want to make a Top 10 List? Or Top 25? Or Top Whatever You Want? It's easy.
So what are the top 10 Power Five schools by season attendance. All we're doing here is chaining commands together with what we've already got. We're *filtering* by our list of Power Five conferences, we're *selecting* the three fields we need, now we're going to *arrange* it by total attendance and then we'll introduce the new function: `top_n`. The `top_n` function just takes a number. So we want a top 10 list? We do it like this:
```{r}
attendance |> filter(Conference %in% powerfive) |> select(Institution, Conference, `2018`) |> arrange(desc(`2018`)) |> top_n(10)
```
That's all there is to it. Just remember -- for it to work correctly, you need to sort your data BEFORE you run top_n. Otherwise, you're just getting the first 10 values in the list. The function doesn't know what field you want the top values of. You have to do it.