forked from dyerlab/applied_population_genetics
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathr_containers.rmd
355 lines (237 loc) · 12.2 KB
/
r_containers.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
# Data Containers {.imageChapter}
<div class="chapter_image"><img src="chapter_images/ch_turtle.jpg"></div>
We almost never work with a single datum^[The word *data* is plural, datum is singular], rather we keep lots of data. Moreover, the kinds of data are often heterogeneous, including categorical (Populations, Regions), continuous (coordinates, rainfall, elevation), imagry (hyperspectral, LiDAR), and perhaps even genetic. R has a very rich set of containers into which we can stuff our data as we work with it. Here these container types are examined and the restrictions and benefits associated with each type are explained.
## Vectors
We have already seen several examples of several vectors in action (see the introduction to Numeric data types for example). A vector of objects is simply a collection of them, often created using the `c()` function (*c* for combine). Vectorized data is restricted to having homogeneous data types---you cannot mix character and numeric types in the same vector. If you try to mix types, R will either coerce your data into a reasonable type
```{r}
x <- c(1,2,3)
x
y <- c(TRUE,TRUE,FALSE)
y
z <- c("I","am","not","a","looser")
z
```
or coearce them into one type that is amenable to all the types of data that you have given it. In this example, a Logical, Character, Constant, and Function are combined resulting in a vector output of type Character.
```{r}
w <- c(TRUE, "1", pi, ls())
w
class(w)
```
Accessing elements within a vector are done using the square bracket `[]` notation. All indices (for vectors and matrices) start at 1 (not zero as is the case for some languages). Getting and setting the components within a vector are accomplished using numeric indices with the assignment operators just like we do for variables containing a single value.
```{r}
x
x[1] <- 2
x[3] <- 1
x
x[2]
```
A common type of vector is that of a sequences. We use sequences all the time, to iterate through a list, to counting generations, etc. There are a few ways to generate sequences, depending upon the step sequence. For a sequence of whole numbers, the easiest is through the use of the colon operator.
```{r}
x <- 1:6
x
```
This provides a nice shorthand for getting the values X:Y from X to Y, inclusive. It is also possible to go backwards using this operator, counting down from X to Y as in:
```{r}
x <- 5:2
x
```
The only constraint here is that we are limited to a step size of 1.0. It is possible to use non-integers as the bounds, it will just count up by 1.0 each time.
```{r}
x <- 3.2:8.4
x
```
If you are interested in making a sequence with a step other than 1.0, you can use the `seq()` function. If you do not provide a step value, it defaults to 1.0.
```{r}
y <- seq(1,6)
y
```
But if you do, it will use that instead.
```{r}
z <- seq(1,20,by=2)
z
```
It is also possible to create a vector of objects as repetitions using the `rep()` (for repeat) function.
```{r}
rep("Beetlejuice",3)
```
If you pass a vector of items to `rep()`, it can repeat these as either a vector being repeated (the default value)
```{r}
x <- c("No","Free","Lunch")
rep(x,time=3)
```
or as each item in the vector repeated.
```{r}
rep(x,each=3)
```
## Matrices
A matrix is a 2- or higher dimensional container, most commonly used to store numeric data types. There are some libraries that use matrices in more than two dimensions (rows and columns and sheets), though you will not run across them too often. Here I restrict myself to only 2-dimensional matrices.
You can define a matrix by giving it a set of values and an indication of the number of rows and columns you want. The easiest matrix to try is one with empty values:
```{r}
matrix(nrow=2, ncol=2)
```
Perhaps more useful is one that is pre-populated with values.
```{r}
matrix(1:4, nrow=2 )
```
Notice that here, there were four entries and I only specified the number of rows required. By default the ‘filling-in' of the matrix will proceed down column (*by-column*). In this example, we have the first column with the first two entries and the last two entries down the second column. If you want it to fill by row, you can pass the optional argument
```{r}
matrix(1:4, nrow=2, byrow=TRUE)
```
and it will fill *by-row*.
When filling matrices, the default size and the size of the data being added to the matrix are critical. For example, I can create a matrix as:
```{r}
Y <- matrix(c(1,2,3,4,5,6),ncol=2,byrow=TRUE)
Y
```
or
```{r}
X <- matrix(c(1,2,3,4,5,6),nrow=2)
X
```
and both produce a similar matrix, only transposed.
```{r}
X == t(Y)
```
In the example above, the number of rows (or columns) was a clean multiple of the number of entries. However, if it is not, R will fill in values.
```{r}
X <- matrix(c(1,2,3,4,5,6),ncol=4, byrow=TRUE)
```
Notice how you get a warning from the interpreter. But that does not stop it from filling in the remaining slots by starting over in the sequence of numbers you passed to it.
```{r}
X
```
The dimensionality of a matrix (and `data.frame` as we will see shortly) is returned by the `dim()` function. This will provide the number of rows and columns as a vector.
```{r}
dim(X)
```
Accessing elements to retrieve or set their values within a matrix is done using the square brackets just like for a vector but you need to give `[row,col]` indices. Again, these are 1-based so that
```{r}
X[1,3]
```
is the entry in the 1st row and 3rd column.
You can also use ‘slices' through a matrix to get the rows
```{r}
X[1,]
```
or columns
```{r}
X[,3]
```
of data. Here you just omit the index for the entity you want to span. Notice that when you grab a slice, even if it is a column, is given as a vector.
```{r}
length(X[,3])
```
You can grab a sub-matrix using slices if you give a range (or sequence) of indices.
```{r}
X[,2:3]
```
If you ask for values from a matrix that exceed its dimensions, R will give you an error.
```{r,eval=FALSE}
X[1,8]
```
```
## Error in X[1, 8] : subscript out of bounds
## Calls: <Anonymous> ... handle -> withCallingHandlers -> withVisible -> eval -> eval
## Execution halted
```
There are a few cool extensions of the `rep()` function that can be used to create matrices as well. They are optional values that can be passed to the function.
* `times=x`: This is the default option that was occupied by the ‘3' in the example above and represents the number of times that first argument will be repeated.
* `each=x` This will take each element in the first argument are repeat them `each` times.
* `length.out=x`: This make the result equal in length to `x`.
In combination, these can be quite helpful. Here is an example using numeric sequences in which it is necessary to find the index of all entries in a 3x2 matrix. To make the indices, I bind two columns together using `cbind()`. There is a matching row binding function, denoted as `rbind()` (perhaps not so surprisingly). What is returned is a matrix
```{r}
indices <- cbind( rep(1:2, each=3), rep(1:3,times=2), rep(5,length.out=6) )
indices
```
## Lists
A list is a type of vector but is indexed by ‘keys' rather than by numeric indices. Moreover, lists can contain heterogeneous types of data (e.g., values of different `class`), which is not possible in a vector type. For example, consider the list
```{r}
theList <- list( x=seq(2,40, by=2), dog=LETTERS[1:5], hasStyle=logical(5) )
summary(theList)
```
which is defined with a numeric, a character, and a logical component. Each of these entries can be different in length as well as type. Once defined, the entries may be observed as:
```{r}
theList
```
Once created, you can add variables to the list using the $-operator followed by the name of the key for the new entry.
```{r}
theList$my_favoriate_number <- 2.9 + 3i
```
or use double brackets and the name of the variable as a character string.
```{r}
theList[["lotto numbers"]] <- rpois(7,lambda=42)
```
The keys currently in the list are given by the `names()` function
```{r}
names(theList)
```
Getting and setting values within a list are done the same way using either the `$`-operator
```{r}
theList$x
theList$x[2] <- 42
theList$x
```
or the double brackets
```{r}
theList[["x"]]
```
or using a numeric index, but that numeric index is looks to the results of `names()` to figure out which key to use.
```{r}
theList[[2]]
```
The use of the double brackets in essence provides a direct link to the variable in the list whose name is second in the `names()` function (*dog* in this case). If you want to access elements within that variable, then you add a second set of brackets on after the double ones.
```{r}
theList[[1]][3]
```
This deviates from the matrix approach as well as from how we access entries in a `data.frame` (described next). It is not a single square bracket with two indices, that gives you an error:
```{r eval=FALSE}
theList[1,3]
```
```
## Error in theList[1, 3] : incorrect number of dimensions
## Calls: <Anonymous> ... handle -> withCallingHandlers -> withVisible -> eval -> eval
## Execution halted
```
List are rather robust objects that allow you to store a wide variety of data types (including nested lists). Once you get the indexing scheme down, it they will provide nice solutions for many of your computational needs.
## Data Frames
The `data.frame` is the default data container in R. It is analogous to both a spreadsheet, at least in the way that I have used spreadsheets in the past, as well as a database. If you consider a single spreadsheet containing measurements and observations from your research, you may have many columns of data, each of which may be a different kind of data. There may be `factors` representing designations such as species, regions, populations, sex, flower color, etc. Other columns may contain numeric data types for items such as latitude, longitude, dbh, and nectar sugar content. You may also have specialized columns such as dates collected, genetic loci, and any other information you may be collecting.
On a spreadsheet, each column has a unified data type, either quantified with a value or as a missing value, `NA`, in each row. Rows typically represent the sampling unit, perhaps individual or site, along which all of these various items have been measured or determined. A `data.frame` is similar to this, at least conceptually. You define a `data.frame` by designating the columns of data to be used. You do not need to define all of them, more can be added later. The values passed can be sequences, collections of values, or computed parameters. For example:
```{r}
df <- data.frame( ID=1:5, Names=c("Bob","Alice","Vicki","John","Sarah"), Score=100 - rpois(5,lambda=10))
df
```
You can see that each column is a unified type of data and each row is equivalent to a record. Additional data columns may be added to an existing data.frame as:
```{r}
df$Passed_Class <- c(TRUE,TRUE,TRUE,FALSE,TRUE)
```
Since we may have many (thousands?) of rows of observations, a `summary()` of the data.frame can provide a more compact description.
```{r}
summary(df)
```
We can add columns of data to the data.frame after the fact using the `$`-operator to indicate the column name. Depending upon the data type, the summary will provide an overview of what is there.
### Indexing Data Frames
You can access individual items within a `data.frame` by numeric index such as:
```{r}
df[1,3]
```
You can slide indices along rows (which return a new `data.frame` for you)
```{r}
df[1,]
```
or along columns (which give you a vector of data)
```{r}
df[,3]
```
or use the `$`-operator as you did for the list data type to get direct access to a either all the data or a specific subset therein.
```{r}
df$Names[3]
```
Indices are ordered just like for matrices, rows first then columns. You can also pass a set of indices such as:
```{r}
df[1:3,]
```
It is also possible to use logical operators as indices. Here I select only those names in the data.frame whose score was >90 and they passed popgen.
```{r}
df$Names[df$Score > 90 & df$Passed_Class==TRUE]
```
This is why `data.frame` objects are very database like. They can contain lots of data and you can extract from them subsets that you need to work on. This is a VERY important feature, one that is vital for reproducible research. Keep you data in one and only one place.