-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy path02-getting_started.qmd
385 lines (297 loc) · 17.2 KB
/
02-getting_started.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
```{r include=FALSE}
source("R/common.R")
knitr::opts_chunk$set(fig.path = "figs/ch02/")
```
# Getting Started {#sec-getting_started}
```{r include=FALSE}
library(ggplot2)
library(tidyverse)
library(broom)
ggplot2::theme_set(theme_bw(base_size = 16))
```
## Why plot your data? {#sec-why_plot}
> Getting information from a table is like extracting sunlight from a cucumber. @FarquharFarquhar:91
At the time the Farhquhar brothers wrote this pithy aphorism, graphical methods for understanding data had advanced considerably, but were not universally practiced, prompting their complaint.
The main graphic forms we use today---the pie chart, line graphs and bar---were invented by William Playfair around 1800 [@Playfair:1786; @Playfair:1801]. The scatterplot arrived shortly after [@Herschel:1833] and thematic maps showing
the spatial distributions of social variables (crime, suicides, literacy)
were used for the first time to reason about important societal questions
[@Guerry:1833] such as "is increased education associated with lower rates of crime?"
\ix{pie chart}
\ix{bar chart}
\ix{line graph}
\ix{scatterplot}
In the last half of the 18th Century, the idea of correlation was
developed [@Galton:1886; @Pearson:1896] and the period, roughly 1860--1890, dubbed
the "Golden Age of Graphics [@Funkhouser:1937]
became the richest period of innovation and beauty in the entire
history of data visualization. During this time there was an incredible development
of visual thinking, represented by the work of Charles Joseph Minard,
advances in the role of visualization within scientific discovery, as illustrated
through Francis Galton, and graphical excellence, embodied in state statistical
atlases produced in France and elsewhere. See @Friendly:2008:golden; @FriendlyWainer:2021:TOGS for this history.
This chapter introduces the importance of graphing data through three nearly classic stories with the following themes:
* **summary statistics are not enough**: Anscombe's Quartet demonstrates datasets that are indistinguishable
by numerical summary statistics (mean, standard deviation, correlation), but whose relationships are
vastly different.
* **one lousy point can ruin your day**: A researcher is mystified by a difference between a correlation
for men and women until she plots the data.
* **finding the signal in noise**: The story of the US 1970 Draft Lottery shows how a weak, but
reliable signal, reflecting bias in a process can be revealed by graphical enhancement
and summarization.
<!-- was: "child/02-anscombe.qmd" -->
### Anscombe's Quartet {#sec-anscombe}
In 1973, Francis Anscombe [@Anscombe:73] famously constructed a set of four datasets
illustrate the importance of plotting the graphs before analyzing and model building, and the effect of unusual observations on fitted models.
Now known as _Anscombe's Quartet_, these datasets had identical statistical properties: the same
means, standard deviations, correlations and regression lines.
His purpose was to debunk three notions that had been prevalent at the time:
* Numerical calculations are exact, but graphs are coarse and limited by perception and resolution;
* For any particular kind of statistical data there is just one set of calculations constituting a correct statistical analysis;
* Performing intricate calculations is virtuous, whereas actually looking at the data is cheating.
The dataset `datasets::anscombe` has 11 observations, recorded in wide format,
with variables `x1:x4` and `y1:y4`.
```{r anscombe-data}
data(anscombe)
head(anscombe)
```
The following code transforms this data to long format and calculates some summary statistics
for each `dataset`.
```{r anscombe-long}
anscombe_long <- anscombe |>
pivot_longer(everything(),
names_to = c(".value", "dataset"),
names_pattern = "(.)(.)"
) |>
arrange(dataset)
anscombe_long |>
group_by(dataset) |>
summarise(xbar = mean(x),
ybar = mean(y),
r = cor(x, y),
intercept = coef(lm(y ~ x))[1],
slope = coef(lm(y ~ x))[2]
)
```
As we can see, all four datasets have nearly identical univariate and bivariate statistical
measures. You can only see how they differ in graphs, which show their true natures to be vastly
different.
@fig-ch02-anscombe1 is an enhanced version of Anscombe's plot of these data, adding
helpful annotations to show visually the underlying statistical summaries.
```{r}
#| label: fig-ch02-anscombe1
#| echo: false
#| fig-align: center
#| out-width: "90%"
#| fig-cap: "Scatterplots of Anscombe's Quartet. Each plot shows the fitted regression line
#| and a 68% data ellipse representing the correlation between $x$ and $y$. "
knitr::include_graphics("figs/ch02/ch02-anscombe1.png")
```
This figure is produced as follows, using a single call to `ggplot()`, faceted by `dataset`.
As we will see later (@sec-data-ellipse), the data ellipse (produced by `stat_ellipse()`)
reflects the correlation between the variables.
```{r}
#| eval: false
desc <- tibble(
dataset = 1:4,
label = c("Pure error", "Lack of fit", "Outlier", "Influence")
)
ggplot(anscombe_long, aes(x = x, y = y)) +
geom_point(color = "blue", size = 4) +
geom_smooth(method = "lm", formula = y ~ x, se = FALSE,
color = "red", linewidth = 1.5) +
scale_x_continuous(breaks = seq(0,20,2)) +
scale_y_continuous(breaks = seq(0,12,2)) +
stat_ellipse(level = 0.5, color=col, type="norm") +
geom_label(data=desc, aes(label = label), x=6, y=12) +
facet_wrap(~dataset, labeller = label_both)
```
The subplots are labeled with the statistical idea they reflect:
* dataset 1: **Pure error**. This is the typical case with well-behaved data. Variation of the points around the line reflect only measurement error or unreliability in the response, $y$.
* dataset 2: **Lack of fit**. The data is clearly curvilinear, and would be very well described by a quadratic, `y ~ poly(x, 2)`. This violates the assumption of linear regression that the fitted model has the correct form.
* dataset 3: **Outlier**. One point, second from the right, has a very large residual. Because this point is near the extreme of $x$, it pulls the regression line towards it, as you can see by imagining a line through the remaining points.
* dataset 4: **Influence**. All but one of the points have the same $x$ value. The one unusual point has sufficient influence to force the regression line to fit it **exactly**.
One moral from this example:
> **Linear regression only "sees" a line. It does its' best when the data are really linear. Because the line is fit by least squares, it pulls the line toward discrepant points to minimize the sum of squared residuals.**
::: {.callout-note title="Datasaurus Dozen"}
The method Anscombe used to compose his quartet is unknown, but it turns out that that there is
a method to construct a wider collection of datasets with identical statistical properties.
After all, in a bivariate dataset with $n$ observations, the correlation
has $(n-2)$ degrees of freedom, so it is possible to choose $n-2$ of the $(x, y)$
pairs to yield any given value.
As it happens, it is also possible to create any number of datasets with the same means,
standard deviations and correlations with nearly any shape you like --- even a dinosaur!
The _Datasaurus Dozen_ was first publicized by Alberto Cairo in a [blog post](http://www.thefunctionalart.com/2016/08/download-datasaurus-never-trust-summary.html)
and are available in the `r package("datasauRus", cite=TRUE)`.
As shown in @fig-datasaurus, the sets include a star, cross, circle, bullseye,
horizontal and vertical lines, and, of course the "dino".
The method [@MatejkaFitzmaurice2017] uses _simulated annealing_, an iterative process that
perturbs the points in a scatterplot, moving them towards a given shape while keeping the statistical summaries close to the fixed target value.
The `r pkg("datasauRus")` package just contains the datasets, but
a general method, called _statistical metamers_,
for producing such datasets has been described by [Elio Campitelli](https://eliocamp.github.io/codigo-r/en/2019/01/statistical-metamerism/)
and implemented in the `r pkg("metamer")` package.
:::
::: {.content-visible when-format="html"}
```{r}
#| label: fig-datasaurus
#| out-width: "90%"
#| echo: false
#| fig-cap: "Animation of the Dinosaur Dozen datasets. Source: [https://youtu.be/It4UA75z_KQ](https://youtu.be/It4UA75z_KQ)"
knitr::include_graphics("images/DataSaurusDozen.gif")
```
:::
::: {.content-visible when-format="pdf"}
```{r}
#| label: fig-datasaurus2
#| out-width: "100%"
#| echo: false
#| fig-cap: "Plots of the Dinosaur Dozen datasets. Source: [Selçuk Korkmaz on X](https://x.com/selcukorkmaz/status/1864583253253927156)"
knitr::include_graphics("images/datasaurus-dozen.jpg")
```
:::
::: {.callout-note title="Quartets"}
The essential idea of a statistical "quartet" is to illustrate four quite different datasets
or circumstances that seem superficially the same, but yet are paradoxically very different
when you look behind the scenes.
For example, in the context of causal analysis @Gelman-etal:2023, illustrated
sets of four graphs, within each of which
all four represent the same average (latent) causal effect but with
much different patterns of individual effects; @McGowan2023 provide another illustration
with four seemingly identical data sets each generated by a different causal mechanism.
As an example of machine learning models, @Biecek-etal:2023, introduced the "Rashamon Quartet",
a synthetic dataset for which four models from different classes
(linear model, regression tree, random forest, neural network)
have practically identical predictive performance.
In all cases, the paradox is solved when
their visualization reveals the distinct ways
of understanding structure in the data.
The [**quartets**](https://r-causal.github.io/quartets/) package contains these and other variations on this theme.
:::
<!-- was: "child/02-davis.qmd" -->
### One lousy point can ruin your day {#sec-davis}
In the mid 1980s, a consulting client had a strange problem.[^davis]
She was conducting a study of the relation between body image and weight preoccupation in
exercising and non-exercising people [@Davis:1990]. As part of the design, the researcher
wanted to know if self-reported weight could be taken as a reliable indicator of
true weight measured on a scale. It was expected that the correlations between reported
and measured weight should be close to 1.0, and the slope of the regression lines
for men and women should also be close to 1.0. The dataset is `car::Davis`.
[^davis]: This story is told apocryphally. The consulting client actually did plot
the data, but needed help making better graphs.
She was therefore very surprise to see the following numerical results: For men,
the correlation was nearly perfect, but not so for women.
<!-- figure-code: R/Davis-reg.R -->
```{r davis-cor}
data(Davis, package="carData")
Davis <- Davis |>
drop_na() # drop missing cases
Davis |>
group_by(sex) |>
select(sex, weight, repwt) |>
summarise(r = cor(weight, repwt))
```
Similarly, the regression lines showed the expected slope for men, but that for women was only 0.26.
```{r davis-lm}
Davis |>
nest(data = -sex) |>
mutate(model = map(data, ~ lm(repwt ~ weight, data = .)),
tidied = map(model, tidy)) |>
unnest(tidied) |>
filter(term == "weight") |>
select(sex, term, estimate, std.error)
```
"What could be wrong here?", the client asked. The consultant replied with the obvious question:
> _Did you plot your data?_
The answer turned out to be one discrepant point, a female, whose measured weight was 166 kg
(366 lbs!). This single point exerted so much influence that it pulled the fitted regression line
down to a slope of only 0.26.
```{r}
#| label: fig-ch02-davis-reg1
#| fig-cap: "Regression for Davis' data on reported weight and measures weight for men and women.
#| Separate regression lines, predicting reported weight from measured weight are shown for males and females.
#| One highly unusual point is highlighted."
# shorthand to position legend inside the figure
legend_inside <- function(position) {
theme(legend.position = "inside",
legend.position.inside = position)
}
Davis |>
ggplot(aes(x = weight, y = repwt,
color = sex, shape = sex, linetype = sex)) +
geom_point(size = ifelse(Davis$weight==166, 6, 2)) +
geom_smooth(method = "lm", formula = y~x, se = FALSE) +
labs(x = "Measured weight (kg)", y = "Reported weight (kg)") +
scale_linetype_manual(values = c(F = "longdash", M = "solid")) +
legend_inside(c(.8, .8))
```
In this example, it was arguable that $x$ and $y$ axes should be reversed,
to determine how well measured weight can be predicted from reported weight.
In `ggplot` this can easily be done by reversing the `x` and `y` aesthetics.
```{r}
#| label: fig-ch02-davis-reg2
#| fig-cap: "Regression for Davis' data on reported weight and measures weight for men and women.
#| Separate regression lines, predicting measured weight from reported weight are shown for males and females.
#| The highly unusual point no longer has an effect on the fitted lines."
Davis |>
ggplot(aes(y = weight, x = repwt, color = sex, shape=sex)) +
geom_point(size = ifelse(Davis$weight==166, 6, 2)) +
labs(y = "Measured weight (kg)", x = "Reported weight (kg)") +
geom_smooth(method = "lm", formula = y~x, se = FALSE) +
legend_inside(c(.8, .8))
```
In @fig-ch02-davis-reg2, this discrepant observation again stands out like a sore thumb, but
it makes very little difference in the fitted line for females.
The reason is that this point is well within the range of the $x$ variable (`repwt`).
To impact the slope of the regression line, an observation must be unusual in_both_ $x$ and $y$.
We take up the topic of how to detect influential observations and what to do about them in
@sec-linear-models-plots.
The value of such plots is not only that they can reveal possible problems with an analysis,
but also help identify their reasons and suggest corrective action.
What went wrong here? Examination of the original data showed that this person switched
the values, recording her reported weight in the box for measured weight and vice versa.
```{r child="child/02-draft1970.qmd"}
```
## Plots for data analysis
Visualization methods take an enormous variety of forms, but it is useful to distinguish several broad categories according to their use in data analysis:
* __data plots__ : primarily plot the raw data, often with annotations to aid interpretation (regression lines and smooths, data ellipses, marginal distributions)
* __reconnaissance plots__ : with more than a few variables, reconnaissance plots provide a high-level,
bird's-eye overview of the data, allowing you to see patterns that might not be visible in a set of separate plots. Some examples are scatterplot matrices, showing all bivariate plots of variables
in a dataset; correlation diagrams, using visual glyphs to represent the correlations between
all pairs of variables and "trellis" or faceted plots that show how a focal relation of
one or more variables differs across values of other variables.
* __model plots__ : plot the results of a fitted model, such as a regression line or curve
to show uncertainty, or a regression surface in 3D, or a plot of coefficients in model
together with confidence intervals.
Other model plots try to take into account that
a fitted model may involve more variables than can be shown in a static 2D plot.
Some examples of these are added variable plots, and marginal effect plots, both of which
attempt to show the net relation of two focal variables, controlling or adjusting for other variables
in a model.
* __diagnostic plots__ : indicating potential problems with the fitted model. These include residual plots, influence plots, plots for testing homogeneity of variance and so forth.
* __dimension reduction plots__ : plot representations of the data into a space of fewer dimensions than the number of variables in the dataset. Simple examples include principal components analysis (PCA) and the related biplots, and multidimensional scaling (MDS) methods.
We give some more details and a few examples in the sections that follow.
## Data plots
Data plots portray the data in a space where the coordinate axes are the observed variables.
* 1D plots include line plots, histograms and density estimates.
* 2D plots are most often scatterplots, but contour plots or hex-binned plots are also useful when the sample size is large.
<!-- * For higher dimensions, biplots, showing the data in principal components space, together with vectors representing the correlations among variables, are often the most useful. -->
## Model plots
Model plots show the fitted or predicted values from a statistical model and provide visual summaries...
## Diagnostic plots
## Principles of graphical display
[This could be a separate chapter]
- Criteria for assessing graphs: communication goals
- Effective data display:
+ Make the data stand out
+ Make graphical comparison easy
+ Effect ordering: For variables and unordered factors, arrange them according to the
effects to be seen
- Visual thinning: As the data becomes more complex, focus more on impactful summaries
**Package summary**
```{r}
#| echo: false
#| results: asis
#cat("Packages used here:\n")
write_pkgs(file = .pkg_file)
```
<!-- ## References {.unnumbered} -->