transforming.qmd

# Transforming data

Sometimes long data needs to be wide, and sometimes wide data needs to be long. I'll explain.

You are soon going to discover that long before you can visualize data, **you need to have it in a form that the visualization library can deal with**. One of the ways that isn't immediately obvious is **how your data is cast**. Most of the data you will encounter will be **wide -- each row will represent a single entity with multiple measures for that entity**. So think of states. Your row of your dataset could have the state name, population, average life expectancy and other demographic data.

But what if your visualization library needs one row for each measure? So state, data type and the data. Nebraska, Population, 1,929,000. That's one row. Then the next row is Nebraska, Average Life Expectancy, 76. That's the next row. That's where recasting your data comes in.

We can use a library called `tidyr` to `pivot_longer` or `pivot_wider` the data, depending on what we need. We'll use a dataset of college football attendance to demonstrate.

```{r}
#| warning: false
#| message: false
#| results: asis
#| echo: false
library(downloadthis)
library(glue)

dllink <- download_link(
  link = "http://mattwaite.github.io/sportsdatafiles/attendance.csv",
  button_label = "Download csv file",
  button_type = "danger",
  has_icon = TRUE,
  icon = "fa fa-save",
  self_contained = FALSE
)

glue("<pre><p><strong>For this walkthrough:</strong></p><p>{dllink}</p></pre>")

```

First we need some libraries.

```{r}
#| warning: false
#| message: false
library(tidyverse)
```

Now we'll load the data.

```{r}
attendance <- read_csv('data/attendance.csv')
```

```{r}
attendance
```

So as you can see, each row represents a school, and then each column represents a year. This is great for calculating the percent change -- we can subtract a column from a column and divide by that column. But later, when we want to chart each school's attendance over the years, we have to have each row be one team for one year. Nebraska in 2013, then Nebraska in 2014, and Nebraska in 2015 and so on.

To do that, we use `pivot_longer` because we're making wide data long. Since all of the columns we want to make rows start with 20, we can use that in our `cols` directive. Then we give that column a name -- Year -- and the values for each year need a name too. Those are the attendance figure. We can see right away how this works.

```{r}
attendance |> pivot_longer(cols = starts_with("20"), names_to = "Year", values_to = "Attendance")
```

We've gone from 150 rows to 900, but that's expected when we have 6 years for each team.

## Making long data wide

We can reverse this process using `pivot_wider`, which makes long data wide.

Why do any of this?

In some cases, you're going to be given long data and you need to calculate some metric using two of the years -- a percent change for instance. So you'll need to make the data wide to do that. You might then have to re-lengthen the data now with the percent change. Some project require you to do all kinds of flexing like this. It just depends on the data.

So let's take what we made above and turn it back into wide data.

```{r}
longdata <- attendance |> pivot_longer(cols = starts_with("20"), names_to = "Year", values_to = "Attendance")

longdata
```

To `pivot_wider`, we just need to say where our column names are coming from -- the Year -- and where the data under it should come from -- Attendance.

```{r}
longdata |> pivot_wider(names_from = Year, values_from = Attendance)
```

And just like that, we're back.

## Why this matters

This matters because certain visualization types need wide or long data. A significant hurdle you will face for the rest of the semester is getting the data in the right format for what you want to do.

So let me walk you through an example using this data.

Let's look at Nebraska's attendance over the time period. In order to do that, I need long data because that's what the charting library, `ggplot2`, needs. You're going to learn a lot more about ggplot later.

```{r}
nebraska <- longdata |> filter(Institution == "Nebraska")
```

Now that we have long data for just Nebraska, we can chart it.

```{r}
ggplot(nebraska, aes(x=Year, y=Attendance, group=1)) + 
  geom_line() + 
  scale_y_continuous(labels = scales::comma) + 
  labs(x="Year", y="Attendance", title="We'll all stick together?", subtitle="It's not as bad as you think -- they widened the seats, cutting the number.", caption="Source: NCAA | By Matt Waite", color = "Outcome") +
  theme_minimal() + 
  theme(
    plot.title = element_text(size = 16, face = "bold"),
    axis.title = element_text(size = 10),
    axis.title.y = element_blank(),
    axis.text = element_text(size = 7),
    axis.ticks = element_blank(),
    panel.grid.minor = element_blank(),
    panel.grid.major.x = element_blank(),
    legend.position="bottom"
  )
```