Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Winter25 data summarization #656

Merged
merged 5 commits into from
Jan 9, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
156 changes: 82 additions & 74 deletions modules/Data_Summarization/Data_Summarization.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -8,11 +8,8 @@ output:


```{r, echo = FALSE, message=FALSE, error = FALSE}
library(knitr)
opts_chunk$set(comment = "", message = FALSE)
suppressWarnings({library(dplyr)})
library(readr)
library(tidyverse)
knitr::opts_chunk$set(comment = "", message = FALSE)
suppressWarnings(library(tidyverse))
```

<style type="text/css">
Expand All @@ -38,7 +35,7 @@ pre { /* Code block - slightly smaller in this lecture */

https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-transformation.pdf

```{r, fig.alt="A preview of the Data transformation cheatsheet produced by RStudio.", out.width = "80%", echo = FALSE, align = "center"}
```{r, fig.alt="A preview of the Data transformation cheatsheet produced by RStudio.", out.width = "80%", echo = FALSE, fig.align = "center"}
knitr::include_graphics("images/Manip_cheatsheet.png")
```

Expand Down Expand Up @@ -99,7 +96,9 @@ sum(z)

## Some examples

We can use the `mtcars` built-in dataset. The `head` command displays the first rows of an object:
We can use the `mtcars` built-in dataset. "The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973-74 models)."

The `head` command displays the first rows of an object:

```{r}
head(mtcars)
Expand All @@ -112,23 +111,10 @@ A nice and readable way to chain together multiple R functions.

Changes `f(x, y)` to `x %>% f(y)`.

```{r eval=FALSE}
# Going to work
get_dressed(me,
pack_lunch(
check_pockets(
wallet = TRUE, phone = TRUE, keys = TRUE),
items = c("sandwich", "chips", "apple"), lunchbox = TRUE),
pants = TRUE, shirt = TRUE, footwear = "sandals")

# Going to work, the tidy way
me %>%
get_dressed(pants = TRUE, shirt = TRUE, footwear = "sandals") %>%
pack_lunch(items = c("sandwich", "chips", "apple"), lunchbox = TRUE) %>%
check_pockets(wallet = TRUE, phone = TRUE, keys = TRUE)
```{r, out.width = "50%", echo = FALSE, fig.align = "center"}
knitr::include_graphics("../../images/lol/morning_1.png")
```


## Statistical summarization the "tidy" way

```{r}
Expand All @@ -141,7 +127,7 @@ mtcars %>% pull(wt) %>% quantile(probs = 0.6)

## Behavior of `pull()` function

`pull()` converts a single data column into a vector. This allows you to run summary functions on these data. Once you have "pulled" the data column out, you don't have to name it again in any piped summary functions.
`pull()` converts a single data column into a <span style="color:blue">vector</span>. This allows you to run summary functions on these data. Once you have "pulled" the data column out, you don't have to name it again in any piped summary functions.

```{r}
cars_wt <- mtcars %>% pull(wt)
Expand All @@ -157,18 +143,29 @@ mtcars %>% pull(wt) %>% range(wt) # Incorrect
mtcars %>% pull(wt) %>% range() # Correct
```

## GUT CHECK

What kind of object do we need to run summary operators like `mean()` ?

A. A vector of numbers

B. A vector of characters

C. A dataset

# Summarization on tibbles (data frames)

## TB Incidence
## TB incidence

Let's read in a `tibble` of values from TB incidence.

"Tuberculosis incidence, all forms (per 100,000 population per year), for the period 1990-2007 across 208 countries/territories."

```{r}
tb <- read_csv("https://jhudatascience.org/intro_to_r/data/tb.csv")
```

## TB Incidence
## TB incidence

Check out the data:

Expand All @@ -177,7 +174,7 @@ head(tb)
```


## TB Incidence
## TB incidence

Check out the data:

Expand All @@ -193,7 +190,6 @@ Before we go further, let's rename the first column using the `rename()` functio
In this case, we have to use the backticks (\`) because there are spaces and funky characters in the name.

```{r}
library(dplyr)
tb <- tb %>%
rename(country = `TB incidence, all forms (per 100 000 population per year)`)
```
Expand All @@ -220,8 +216,8 @@ You can also do more elaborate summaries across different groups of data using `
```{r, eval = FALSE}
# General format - Not the code!
{data to use} %>%
summarize({summary column name} = {operator(source column)},
{summary column name} = {operator(source column)})
summarize({summary column name} = {function(source column)},
{summary column name} = {function(source column)})
```
</div>

Expand All @@ -234,7 +230,7 @@ You can also do more elaborate summaries across different groups of data using `
```{r, eval = FALSE}
# General format - Not the code!
{data to use} %>%
summarize({summary column name} = {operator(source column)})
summarize({summary column name} = {function(source column)})
```
</div>

Expand Down Expand Up @@ -291,10 +287,11 @@ summary(tb)

## Summary & Lab Part 1

- summary stats (`mean()`) work with `pull()`
- `pull()` creates a *vector*
- don't forget the `na.rm = TRUE` argument!
- `summary(x)`: quantile information
- `summarize`: creates a summary table of columns of interest
- summary stats (`mean()`) work with vectors or with `summarize()`

🏠 [Class Website](https://jhudatascience.org/intro_to_r/)

Expand All @@ -306,6 +303,8 @@ summary(tb)
Here we will be using the Youth Tobacco Survey data:
http://jhudatascience.org/intro_to_r/data/Youth_Tobacco_Survey_YTS_Data.csv

* Check out the data at: https://catalog.data.gov/dataset/youth-tobacco-survey-yts-data

```{r}
yts <- read_csv("http://jhudatascience.org/intro_to_r/data/Youth_Tobacco_Survey_YTS_Data.csv")
head(yts)
Expand All @@ -324,7 +323,7 @@ yts %>%

## How many `distinct()` values?

`n_distinct()` tells you the number of unique elements. _Must pull the column first!_
`n_distinct()` tells you the number of unique elements. It needs a vector so you _must pull the column first!_

```{r}
yts %>%
Expand All @@ -338,7 +337,7 @@ options(max.print = 1000)
```


## `dplyr`: `count`
## Use `count()` to return row count per category.

Use `count` to return a frequency table of unique elements of a data.frame.

Expand All @@ -347,31 +346,33 @@ yts %>% count(LocationDesc)
```


## `dplyr`: `count`

Multiple columns listed further subdivides the count.
## Multiple columns listed further subdivides the `count()`

```{r, message = FALSE}
yts %>% count(LocationDesc, TopicDesc)
```

**Note:** `count()` includes NAs

## `dplyr`: `count`

Multiple columns listed further subdivides the count.
## GUT CHECK

```{r, message = FALSE}
yts %>% count(LocationDesc, TopicDesc)
```
The `count()` function can help us tally:

<br>
A. Sample size

**Note:** `count()` includes NAs
B. Rows per each category

C. How many categories

# Grouping

## Perform Operations By Groups: dplyr
## Goal

We want to find the average frequency that youth use tobacco products in the dataset.

_How do we do this?_

## Perform operations By groups: dplyr

`group_by` allows you group the data set by variables/columns you specify:

Expand All @@ -381,7 +382,7 @@ yts
```


## Perform Operations By Groups: dplyr
## Perform operations by groups: dplyr

`group_by` allows you group the data set by variables/columns you specify:

Expand All @@ -400,7 +401,7 @@ yts_grouped %>% summarize(avg_percent = mean(Data_Value, na.rm = TRUE))
```


## Use the `pipe` to string these together!
## Do it in one step: use `%>%` to string these together!

Pipe `yts` into `group_by`, then pipe that into `summarize`:

Expand Down Expand Up @@ -474,20 +475,19 @@ yts %>%
`count()` and `n()` can give very similar information.

```{r}
mtcars %>% count(cyl)
mtcars %>% group_by(cyl) %>% summarize(n()) # n() typically used with summarize
yts %>% count(YEAR) %>% head(n = 3)
yts %>% group_by(YEAR) %>% summarize(n = n()) %>% head(n = 3) # n() typically used with summarize
```


# A few miscellaneous topics ..
# A few miscellaneous topics


## Base R functions you might see: `length` and `unique`

These functions require a column as a vector using `pull()`.

```{r, message = FALSE}
yts <- read_csv("http://jhudatascience.org/intro_to_r/data/Youth_Tobacco_Survey_YTS_Data.csv")
yts_loc <- yts %>% pull(LocationDesc) # pull() to make a vector
yts_loc %>% unique() # similar to distinct()
```
Expand All @@ -500,38 +500,26 @@ These functions require a column as a vector using `pull()`.
yts_loc %>% unique() %>% length() # similar to n_distinct()
```

## * New! * Many dplyr functions now have a `.by=` argument

Pipe `yts` into `group_by`, then pipe that into `summarize`:

```{r eval = FALSE}
yts %>%
group_by(Response) %>%
summarize(avg_percent = mean(Data_Value, na.rm = TRUE),
max_percent = max(Data_Value, na.rm = TRUE))
```

is the same as..

```{r eval = FALSE}
yts %>%
summarize(avg_percent = mean(Data_Value, na.rm = TRUE),
max_percent = max(Data_Value, na.rm = TRUE),
.by = Response)
```


## `summary()` vs. `summarize()`

* `summary()` (base R) gives statistics table on a dataset.
* `summarize()` (dplyr) creates a more customized summary tibble/dataframe.

## Functions you might also see

* `rowwise`()`: functions will compute results for each row
* `sum(!is.na())`: # of non-NAs in the data
* `first()`: first value in the data
* `last()`: last value in the data
* `range()`: minimum and maximum of the data
* `IQR()`: interquartile range of the data

## Summary & Lab Part 2

- `count(x)`: what unique values do you have?
- `distinct()`: what are the distinct values?
- `n_distinct()` with `pull()`: how many distinct values?
- `group_by()`: changes all subsequent functions
- `group_by()`: changes subsequent functions (remove with `ungroup()`)
- combine with `summarize()` to get statistics per group
- combine with `mutate()` to add column
- `summarize()` with `n()` gives the count (NAs included)
Expand All @@ -540,7 +528,7 @@ yts %>%

💻 [Lab](https://jhudatascience.org/intro_to_r/modules/Data_Summarization/lab/Data_Summarization_Lab.Rmd)

```{r, fig.alt="The End", out.width = "50%", echo = FALSE, fig.align='center'}
```{r, fig.alt="The End", out.width = "20%", echo = FALSE, fig.align='center'}
knitr::include_graphics(here::here("images/the-end-g23b994289_1280.jpg"))
```

Expand Down Expand Up @@ -592,3 +580,23 @@ tb %>%
tb %>%
summarize(across(starts_with("year"), ~mean(.x, na.rm = TRUE)))
```

## * New! * Many dplyr functions now have a `.by=` argument

Pipe `yts` into `group_by`, then pipe that into `summarize`:

```{r eval = FALSE}
yts %>%
group_by(Response) %>%
summarize(avg_percent = mean(Data_Value, na.rm = TRUE),
max_percent = max(Data_Value, na.rm = TRUE))
```

is the same as..

```{r eval = FALSE}
yts %>%
summarize(avg_percent = mean(Data_Value, na.rm = TRUE),
max_percent = max(Data_Value, na.rm = TRUE),
.by = Response)
```
12 changes: 6 additions & 6 deletions modules/Data_Summarization/lab/Data_Summarization_Lab.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ bike <- read_csv(file = "http://jhudatascience.org/intro_to_r/data/Bike_Lanes.cs

### 1.1

How many bike "lanes" are currently in Baltimore? You can assume each observation/row is a different bike "lane". (hint: how do you get the number of rows of a data set? You can use `dim()` or `nrow()` or another function).
How many streets with designated bike lanes are currently in Baltimore? You can assume each observation/row is a different street with one or more bike lanes. (Hint: how do you get the number of rows of a data set? You can use `dim()` or `nrow()` or another function).

```{r 1.1response}

Expand All @@ -47,7 +47,7 @@ Summarize the data to get the `max` of `length` using the `summarize` function.
```
# General format
DATA_TIBBLE %>%
summarize(SUMMARY_COLUMN_NAME = OPERATOR(SOURCE_COLUMN))
summarize(SUMMARY_COLUMN_NAME = FUNCTION(SOURCE_COLUMN))
```

```{r 1.3response}
Expand All @@ -61,8 +61,8 @@ Modify your code from 1.3 to add the `min` of `length` using the `summarize` fun
```
# General format
DATA_TIBBLE %>%
summarize(SUMMARY_COLUMN_NAME = OPERATOR(SOURCE_COLUMN),
SUMMARY_COLUMN_NAME = OPERATOR(SOURCE_COLUMN)
summarize(SUMMARY_COLUMN_NAME = FUNCTION(SOURCE_COLUMN),
SUMMARY_COLUMN_NAME = FUNCTION(SOURCE_COLUMN)
)
```

Expand All @@ -80,8 +80,8 @@ Summarize the `bike` data to get the mean of `length` and `dateInstalled`. Make
```
# General format
DATA_TIBBLE %>%
summarize(SUMMARY_COLUMN_NAME = OPERATOR(SOURCE_COLUMN, na.rm = TRUE),
SUMMARY_COLUMN_NAME = OPERATOR(SOURCE_COLUMN, na.rm = TRUE)
summarize(SUMMARY_COLUMN_NAME = FUNCTION(SOURCE_COLUMN, na.rm = TRUE),
SUMMARY_COLUMN_NAME = FUNCTION(SOURCE_COLUMN, na.rm = TRUE)
)
```

Expand Down
Loading
Loading