-
Notifications
You must be signed in to change notification settings - Fork 176
/
Copy pathch02.Rmd
388 lines (260 loc) · 14.9 KB
/
ch02.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
---
output:
bookdown::html_document2:
fig_caption: yes
editor_options:
chunk_output_type: console
---
```{r echo = FALSE, cache = FALSE}
# This block needs cache=FALSE to set fig.width and fig.height, and have those
# persist across cached builds.
source("utils.R", local = TRUE)
knitr::opts_chunk$set(fig.width=5, fig.height=4, out.width="50%")
```
Quickly Exploring Data {#CHAPTER-QUICK}
======================
Although I've used the ggplot2 package for most of the graphics in this book, it is not the only way to plot data. For very quick exploration of data, it's sometimes useful to use the plotting functions in base R. These are installed by default with R and do not require any additional packages to be installed. They're quick to type, straightforward to use in simple cases, and run very quickly.
If you want to do anything beyond very simple plots, though, it's generally better to switch to ggplot2. This is in part because ggplot2 provides a unified interface and set of options, instead of the grab bag of modifiers and special cases required in base graphics. Once you learn how ggplot2 works, you can use that knowledge for everything from scatter plots and histograms to violin plots and maps.
Each recipe in this section shows how to make a graph with base graphics. Each recipe also shows how to make a similar graph with the `ggplot()` function in ggplot2. The previous edition of this book also gave examples using the `qplot()` function from the ggplot2 package, but now it is recommended to just use `ggplot()` instead.
If you already know how to use R's base graphics, having these examples side by side will help you transition to using ggplot2 for when you want to make more sophisticated graphics.
Creating a Scatter Plot {#RECIPE-QUICK-SCATTER}
-----------------------
### Problem
You want to create a scatter plot.
### Solution
To make a scatter plot (Figure \@ref(fig:FIG-QUICK-SCATTER-BASE)), use `plot()` and pass it a vector of *x* values followed by a vector of *y* values:
```{r FIG-QUICK-SCATTER-BASE, small.mar=TRUE, fig.cap='Scatter plot with base graphics', out.width="70%"}
plot(mtcars$wt, mtcars$mpg)
```
The `mtcars$wt` returns the column named `wt` from the `mtcars` data frame, and `mtcars$mpg` is the `mpg` column.
With ggplot2, you can get a similar result using the `ggplot()` function (Figure Figure \@ref(fig:FIG-QUICK-SCATTER-GGPLOT)):
```{r FIG-QUICK-SCATTER-GGPLOT, fig.cap='Scatter plot with ggplot2', out.width="70%"}
library(ggplot2)
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point()
```
The first part, `ggplot()`, tell it to create a plot object, and the second part, `geom_point()`, tells it to add a layer of points to the plot.
The usual way to use `ggplot()` is to pass it a data frame (`mtcars`) and then tell it which columns to use for the x and y values. If you want to pass it two vectors for x and y values, you can use `data = NULL`, and then pass it the vectors. Keep in mind that ggplot2 is designed to work with data frames as the data source, not individual vectors, and that using it this way will only allow you to use a limited part of its capabilities.
```{r, eval=FALSE}
ggplot(data = NULL, aes(x = mtcars$wt, y = mtcars$mpg)) +
geom_point()
```
It is common to see `ggplot()` commands spread across multiple lines, so you may see the above code also written like this:
### See Also
See Chapter \@ref(CHAPTER-SCATTER) for more in-depth information about creating scatter plots.
Creating a Line Graph {#RECIPE-QUICK-LINE}
---------------------
### Problem
You want to create a line graph.
### Solution
To make a line graph using `plot()` (Figure \@ref(fig:FIG-QUICK-LINE-BASE), left), pass it a vector of x values and a vector of y values, and use `type = "l"`:
```{r eval=FALSE}
plot(pressure$temperature, pressure$pressure, type = "l")
```
```{r FIG-QUICK-LINE-BASE, echo=FALSE, small.mar=TRUE, fig.show="hold", fig.cap="Line graph with base graphics (left); With points and another line (right)"}
plot(pressure$temperature, pressure$pressure, type = "l")
# This code should be the same as the following block, but there's a problem:
# knitr won't show the output of the previous plot() if this line is exactly
# like the one above So in this case we'll plot points and then lines. The
# output looks the same and knitr won't suppress the output.
plot(pressure$temperature, pressure$pressure, type = "p")
lines(pressure$temperature, pressure$pressure)
lines(pressure$temperature, pressure$pressure/2, col = "red")
points(pressure$temperature, pressure$pressure/2, col = "red")
```
To add points and/or multiple lines (Figure \@ref(fig:FIG-QUICK-LINE-BASE), right), first call `plot()` for the first line, then add points with `points()` and additional lines with `lines()`:
```{r eval=FALSE}
plot(pressure$temperature, pressure$pressure, type = "l")
points(pressure$temperature, pressure$pressure)
lines(pressure$temperature, pressure$pressure/2, col = "red")
points(pressure$temperature, pressure$pressure/2, col = "red")
```
With ggplot2, you can get a similar result using `geom_line()` (Figure \@ref(fig:FIG-QUICK-LINE-GGPLOT)):
```{r eval=FALSE}
library(ggplot2)
ggplot(pressure, aes(x = temperature, y = pressure)) +
geom_line()
```
(ref:cap-FIG-QUICK-LINE-GGPLOT) Line graph with `ggplot()` (left); With points added (right)
```{r FIG-QUICK-LINE-GGPLOT, echo=FALSE, fig.show="hold", fig.cap="(ref:cap-FIG-QUICK-LINE-GGPLOT)"}
ggplot(pressure, aes(x = temperature, y = pressure)) +
geom_line()
# Equivalent to:
ggplot(pressure, aes(x = temperature, y = pressure)) +
geom_line() +
geom_point()
```
As with scatter plots, you can pass you data in vectors instead of in a data frame (but this will limit the things you can do later with the plot):
```{r eval=FALSE}
ggplot(pressure, aes(x = temperature, y = pressure)) +
geom_line() +
geom_point()
```
> **Note**
>
> It's common with `ggplot()` to split the command on multiple lines, ending each line with a `+` so that R knows that the command will continue on the next line.
### See Also
See Chapter \@ref(CHAPTER-LINE-GRAPH) for more in-depth information about creating line graphs.
Creating a Bar Graph {#RECIPE-QUICK-BAR}
--------------------
### Problem
You want to make a bar graph.
### Solution
To make a bar graph of values (Figure \@ref(fig:FIG-QUICK-BAR-BASE), left), use `barplot()` and pass it a vector of values for the height of each bar and (optionally) a vector of labels for each bar. If the vector has names for the elements, the names will automatically be used as labels:
```{r}
# First, take a look at the BOD data
BOD
```
```{r eval=FALSE}
barplot(BOD$demand, names.arg = BOD$Time)
```
```{r FIG-QUICK-BAR-BASE, echo=FALSE, small.mar=TRUE, fig.show="hold", fig.cap="Bar graph of values with base graphics (left); Bar graph of counts (right)"}
barplot(BOD$demand, names.arg = BOD$Time)
barplot(table(mtcars$cyl))
```
Sometimes "bar graph" refers to a graph where the bars represent the *count* of cases in each category. This is similar to a histogram, but with a discrete instead of continuous x-axis. To generate the count of each unique value in a vector, use the `table()` function:
````{r}
# There are 11 cases of the value 4, 7 cases of 6, and 14 cases of 8
table(mtcars$cyl)
```
Then pass the table to `barplot()` to generate the graph of counts:
```{r eval=FALSE}
# Generate a table of counts
barplot(table(mtcars$cyl))
```
With ggplot2, you can get a similar result using `geom_col()` (Figure \@ref(fig:FIG-QUICK-BAR-GGPLOT)). To plot a bar graph of *values*, use `geom_col()`. Notice the difference in the output when the *x* variable is continuous and when it is discrete:
(ref:cap-FIG-QUICK-BAR-GGPLOT) Bar graph of values using `geom_col()` with a continuous x variable (left); With x variable converted to a factor (notice that there is no entry for 6; right)
```{r FIG-QUICK-BAR-GGPLOT, fig.show="hold", fig.cap="(ref:cap-FIG-QUICK-BAR-GGPLOT)"}
library(ggplot2)
# Bar graph of values. This uses the BOD data frame, with the
# "Time" column for x values and the "demand" column for y values.
ggplot(BOD, aes(x = Time, y = demand)) +
geom_col()
# Convert the x variable to a factor, so that it is treated as discrete
ggplot(BOD, aes(x = factor(Time), y = demand)) +
geom_col()
```
ggplot2 can also be used to plot the *count* of the number of data rows in each category (Figure \@ref(fig:FIG-QUICK-BAR-GGPLOT-COUNT), by using `geom_bar()` instead of `geom_col()`. Once again, notice the difference between a continuous x-axis and a discrete one. For some kinds of data, it may make more sense to convert the continuous x variable to a discrete one, with the `factor()` function.
(ref:cap-FIG-QUICK-BAR-GGPLOT-COUNT) Bar graph of counts using `geom_bar()` with a continuous x variable (left); With x variable converted to a factor (right)
```{r FIG-QUICK-BAR-GGPLOT-COUNT, fig.show="hold", fig.cap="(ref:cap-FIG-QUICK-BAR-GGPLOT-COUNT)"}
# Bar graph of counts This uses the mtcars data frame, with the "cyl" column for
# x position. The y position is calculated by counting the number of rows for
# each value of cyl.
ggplot(mtcars, aes(x = cyl)) +
geom_bar()
# Bar graph of counts
ggplot(mtcars, aes(x = factor(cyl))) +
geom_bar()
```
> **Note**
>
> In previous versions of ggplot2, the recommended way to create a bar graph of values was to use `geom_bar(stat = "identity")`. As of ggplot2 2.2.0, there is a `geom_col()` function which does the same thing.
### See Also
See Chapter \@ref(CHAPTER-BAR-GRAPH) for more in-depth information about creating bar graphs.
Creating a Histogram {#RECIPE-QUICK-HISTOGRAM}
--------------------
### Problem
You want to view the distribution of one-dimensional data with a histogram.
### Solution
To make a histogram (Figure \@ref(fig:FIG-QUICK-HIST-BASE)), use `hist()` and pass it a vector of values:
```{r FIG-QUICK-HIST-BASE, small.mar=TRUE, fig.show="hold", fig.cap="Histogram with base graphics (left); With more bins. Notice that because the bins are narrower, there are fewer items in each bin. (right)"}
hist(mtcars$mpg)
# Specify approximate number of bins with breaks
hist(mtcars$mpg, breaks = 10)
```
With the ggplot2, you can get a similar result using `geom_histogram()` (Figure \@ref(fig:FIG-QUICK-HIST-GGPLOT)):
```{r FIG-QUICK-HIST-GGPLOT, fig.show="hold", fig.cap="ggplot2 histogram with default bin width (left); With wider bins (right)"}
library(ggplot2)
ggplot(mtcars, aes(x = mpg)) +
geom_histogram()
# With wider bins
ggplot(mtcars, aes(x = mpg)) +
geom_histogram(binwidth = 4)
```
When you create a histogram without specifying the bin width, `ggplot()` prints out a message telling you that it's defaulting to 30 bins, and to pick a better bin width. This is because it's important to explore your data using different bin widths; the default of 30 may or may not show you something useful about your data.
### See Also
For more in-depth information about creating histograms, see Recipes
Recipe \@ref(RECIPE-DISTRIBUTION-BASIC-HIST) and
Recipe \@ref(RECIPE-DISTRIBUTION-MULTI-HIST).
Creating a Box Plot {#RECIPE-QUICK-BOXPLOT}
-------------------
### Problem
You want to create a box plot for comparing distributions.
### Solution
To make a box plot (Figure \@ref(fig:FIG-QUICK-BOXPLOT-BASE)), use `plot()` and pass it a factor of x values and a vector of y values. When x is a factor (as opposed to a numeric vector), it will automatically create a box plot:
```{r eval=FALSE}
plot(ToothGrowth$supp, ToothGrowth$len)
```
```{r FIG-QUICK-BOXPLOT-BASE, echo=FALSE, small.mar=TRUE, fig.show="hold", fig.cap="Box plot with base graphics (left); With multiple grouping variables (right)", fig.width=10, out.width="100%"}
layout(t(c(1,2,2)))
plot(ToothGrowth$supp, ToothGrowth$len)
boxplot(len ~ supp + dose, data = ToothGrowth)
```
If the two vectors are in the same data frame, you can also use the `boxplot()` function with formula syntax. With this syntax, you can combine two variables on the x-axis, as in Figure \@ref(fig:FIG-QUICK-BOXPLOT-BASE):
```{r eval=FALSE}
# Formula syntax
boxplot(len ~ supp, data = ToothGrowth)
# Put interaction of two variables on x-axis
boxplot(len ~ supp + dose, data = ToothGrowth)
```
With the ggplot2 package, you can get a similar result (Figure \@ref(fig:FIG-QUICK-BOXPLOT-GGPLOT)), with `geom_boxplot()`:
```{r eval=FALSE}
library(ggplot2)
ggplot(ToothGrowth, aes(x = supp, y = len)) +
geom_boxplot()
```
(ref:cap-FIG-QUICK-BOXPLOT-GGPLOT) Box plot with `ggplot()` (left); With multiple grouping variables (right)
```{r FIG-QUICK-BOXPLOT-GGPLOT, echo=FALSE, fig.show="hold", fig.cap="(ref:cap-FIG-QUICK-BOXPLOT-GGPLOT)"}
ggplot(ToothGrowth, aes(supp, len)) +
geom_boxplot()
ggplot(ToothGrowth, aes(x = interaction(supp, dose), y = len)) +
geom_boxplot()
```
It's also possible to make box plots for multiple variables, by combining the variables with `interaction()`, as in Figure \@ref(fig:FIG-QUICK-BOXPLOT-GGPLOT):
```{r eval=FALSE}
ggplot(ToothGrowth, aes(x = interaction(supp, dose), y = len)) +
geom_boxplot()
```
> **Note**
>
> You may have noticed that the box plots from base graphics are ever-so-slightly different from those from ggplot2. This is because they use slightly different methods for calculating quantiles. See `?geom_boxplot` and `?boxplot.stats` for more information on how they differ.
### See Also
For more on making basic box plots, see Recipe \@ref(RECIPE-DISTRIBUTION-BASIC-BOXPLOT).
Plotting a Function Curve {#RECIPE-QUICK-FUNCTION}
-------------------------
### Problem
You want to plot a function curve.
### Solution
To plot a function curve, as in Figure \@ref(fig:FIG-QUICK-FUNCTION-BASE), use `curve()` and pass it an expression with the variable x:
```{r eval=FALSE}
curve(x^3 - 5*x, from = -4, to = 4)
```
```{r FIG-QUICK-FUNCTION-BASE, echo=FALSE, small.mar=TRUE, fig.show="hold", fig.cap="Function curve with base graphics (left); With user-defined function (right)"}
curve(x^3 - 5*x, from = -4, to = 4)
# Plot a user-defined function
myfun <- function(xvar) {
1 / (1 + exp(-xvar + 10))
}
curve(myfun(x), from = 0, to = 20)
# Add a line:
curve(1 - myfun(x), add = TRUE, col = "red")
```
You can plot any function that takes a numeric vector as input and returns a numeric vector, including functions that you define yourself. Using `add = TRUE` will add a curve to the previously created plot:
```{r eval=FALSE}
# Plot a user-defined function
myfun <- function(xvar) {
1 / (1 + exp(-xvar + 10))
}
curve(myfun(x), from = 0, to = 20)
# Add a line:
curve(1 - myfun(x), add = TRUE, col = "red")
```
With ggplot2, you can get a similar result (Figure \@ref(fig:FIG-QUICK-FUNCTION-GGPLOT)), by using `stat_function(geom = "line")` and passing it a function that takes a numeric vector as input and returns a numeric vector:
```{r FIG-QUICK-FUNCTION-GGPLOT, fig.cap="A function curve with ggplot2", out.width="70%"}
library(ggplot2)
# This sets the x range from 0 to 20
ggplot(data.frame(x = c(0, 20)), aes(x = x)) +
stat_function(fun = myfun, geom = "line")
```
### See Also
See Recipe \@ref(RECIPE-MISCGRAPH-FUNCTION) for more in-depth information about plotting function curves.