r_containers.rmd

# Data Containers {.imageChapter}

<div class="chapter_image"><img src="chapter_images/ch_turtle.jpg"></div>

We almost never work with a single datum^[The word *data* is plural, datum is singular], rather we keep lots of data.  Moreover, the kinds of data are often heterogeneous, including categorical (Populations, Regions),  continuous (coordinates, rainfall, elevation), imagry (hyperspectral, LiDAR), and perhaps even genetic.  R has a very rich set of containers into which we can stuff our data as we work with it.  Here these container types are examined and the restrictions and benefits associated with each type are explained.

## Vectors

We have already seen several examples of several vectors in action (see the introduction to Numeric data types for example).  A vector of objects is simply a collection of them, often created using the `c()` function (*c* for combine).  Vectorized data is restricted to having homogeneous data types---you cannot mix character and numeric types in the same vector.  If you try to mix types, R will either coerce your data into a reasonable type

```{r}
x <- c(1,2,3)
x
y <- c(TRUE,TRUE,FALSE)
y
z <- c("I","am","not","a","looser")
z
```


or coearce them into one type that is amenable to all the types of data that you have given it.  In this example, a Logical, Character, Constant, and Function are combined resulting in a vector output of type Character.

```{r}
w <- c(TRUE, "1", pi, ls())
w
class(w)
```

Accessing elements within a vector are done using the square bracket `[]` notation.  All indices (for vectors and matrices) start at 1 (not zero as is the case for some languages).  Getting and setting the components within a vector are accomplished using numeric indices with the assignment operators just like we do for variables containing a single value.

```{r}
x
x[1] <- 2
x[3] <- 1
x
x[2]
```

A common type of vector is that of a sequences.  We use sequences all the time, to iterate through a list, to counting generations, etc.  There are a few ways to generate sequences, depending upon the step sequence.  For a sequence of whole numbers, the easiest is through the use of the colon operator.

```{r}
x <- 1:6
x
```

This provides a nice shorthand for getting the values X:Y from X to Y, inclusive.  It is also possible to go backwards using this operator, counting down from X to Y as in:

```{r}
x <- 5:2
x
```

The only constraint here is that we are limited to a step size of 1.0.  It is possible to use non-integers as the bounds, it will just count up by 1.0 each time.

```{r}
x <- 3.2:8.4
x
```

If you are interested in making a sequence with a step other than 1.0, you can use the `seq()` function.  If you do not provide a step value, it defaults to 1.0.

```{r}
y <- seq(1,6)
y
```

But if you do, it will use that instead.

```{r}
z <- seq(1,20,by=2)
z
```

It is also possible to create a vector of objects as repetitions using the `rep()` (for repeat) function.

```{r}
rep("Beetlejuice",3)
```

If you pass a vector of items to `rep()`, it can repeat these as either a vector being repeated (the default value)

```{r}
x <- c("No","Free","Lunch")
rep(x,time=3)
```

or as each item in the vector repeated.

```{r}
rep(x,each=3)
```


## Matrices 

A matrix is a 2- or higher dimensional container, most commonly used to store numeric data types.  There are some libraries that use matrices in more than two dimensions (rows and columns and sheets), though you will not run across them too often.  Here I restrict myself to only 2-dimensional matrices.

You can define a matrix by giving it a set of values and an indication of the number of rows and columns you want.  The easiest matrix to try is one with empty values:

```{r}
matrix(nrow=2, ncol=2)
```

Perhaps more useful is one that is pre-populated with values.

```{r}
matrix(1:4, nrow=2 )
```

Notice that here, there were four entries and I only specified the number of rows required.  By default the ‘filling-in' of the matrix will proceed down column (*by-column*).  In this example, we have the first column with the first two entries and the last two entries down the second column.  If you want it to fill by row, you can pass the optional argument

```{r}
matrix(1:4, nrow=2, byrow=TRUE)
```

and it will fill *by-row*.

When filling matrices, the default size and the size of the data being added to the matrix are critical.  For example, I can create a matrix as:

```{r}
Y <- matrix(c(1,2,3,4,5,6),ncol=2,byrow=TRUE)
Y
```

or 

```{r}
X <- matrix(c(1,2,3,4,5,6),nrow=2)
X
```

and both produce a similar matrix, only transposed.

```{r}
X == t(Y)
```

In the example above, the number of rows (or columns) was a clean multiple of the number of entries.  However, if it is not, R will fill in values.

```{r}
X <- matrix(c(1,2,3,4,5,6),ncol=4, byrow=TRUE)
```

Notice how you get a warning from the interpreter.  But that does not stop it from filling in the remaining slots by starting over in the sequence of numbers you passed to it.

```{r}
X
```

The dimensionality of a matrix (and `data.frame` as we will see shortly) is returned by the `dim()` function.  This will provide the number of rows and columns as a vector.

```{r}
dim(X)
```

Accessing elements to retrieve or set their values within a matrix is done using the square brackets just like for a vector but you need to give `[row,col]` indices.  Again, these are 1-based so that 

```{r}
X[1,3]
```

is the entry in the 1st row and 3rd column.

You can also use ‘slices' through a matrix to get the rows

```{r}
X[1,]
```

or columns

```{r}
X[,3]
```

of data. Here you just omit the index for the entity you want to span.  Notice that when you grab a slice, even if it is a column, is given as a vector.  

```{r}
length(X[,3])
```

You can grab a sub-matrix using slices if you give a range (or sequence) of indices.

```{r}
X[,2:3]
```

If you ask for values from a matrix that exceed its dimensions, R will give you an error.
 
```{r,eval=FALSE}
X[1,8]
```

```
## Error in X[1, 8] : subscript out of bounds
## Calls: <Anonymous> ... handle -> withCallingHandlers -> withVisible -> eval -> eval
## Execution halted
```

There are a few cool extensions of the `rep()` function that can be used to create matrices as well.  They are optional values that can be passed to the function. 

* `times=x`: This is the default option that was occupied by the ‘3' in the example above and represents the number of times that first argument will be repeated.  
* `each=x` This will take each element in the first argument are repeat them `each` times.   
* `length.out=x`: This make the result equal in length to `x`.  

In combination, these can be quite helpful.  Here is an example using numeric sequences in which it is necessary to find the index of all entries in a 3x2 matrix.  To make the indices, I bind two columns together using `cbind()`.  There is a matching row binding function, denoted as `rbind()` (perhaps not so surprisingly).  What is returned is a matrix

```{r}
indices <- cbind( rep(1:2, each=3), rep(1:3,times=2), rep(5,length.out=6)  )
indices
```


## Lists

A list is a type of vector but is indexed by ‘keys' rather than by numeric indices.  Moreover, lists can contain heterogeneous types of data (e.g., values of different `class`), which is not possible in a vector type.  For example, consider the list

```{r}
theList <- list( x=seq(2,40, by=2), dog=LETTERS[1:5], hasStyle=logical(5) )
summary(theList)
```

which is defined with a numeric, a character, and a logical component.  Each of these entries can be different in length as well as type.  Once defined, the entries may be observed as:

```{r}
theList
```

Once created, you can add variables to the list using the $-operator followed by the name of the key for the new entry.

```{r}
theList$my_favoriate_number <- 2.9 + 3i
```

or use double brackets and the name of the variable as a character string.

```{r}
theList[["lotto numbers"]] <- rpois(7,lambda=42)
```

The keys currently in the list are given by the `names()` function

```{r}
names(theList)
```

Getting and setting values within a list are done the same way using either the `$`-operator

```{r}
theList$x
theList$x[2] <- 42
theList$x
```

or the double brackets

```{r}
theList[["x"]]
```

or using a numeric index, but that numeric index is looks to the results of `names()` to figure out which key to use.

```{r}
theList[[2]]
```

The use of the double brackets in essence provides a direct link to the variable in the list whose name is second in the `names()` function (*dog* in this case).  If you want to access elements within that variable, then you add a second set of brackets on after the double ones.  

```{r}
theList[[1]][3]
```

This deviates from the matrix approach as well as from how we access entries in a `data.frame` (described next).  It is not a single square bracket with two indices, that gives you an error:

```{r eval=FALSE}
theList[1,3] 
```

```
## Error in theList[1, 3] : incorrect number of dimensions
## Calls: <Anonymous> ... handle -> withCallingHandlers -> withVisible -> eval -> eval
## Execution halted
```

List are rather robust objects that allow you to store a wide variety of data types (including nested lists).  Once you get the indexing scheme down, it they will provide nice solutions for many of your computational needs.

## Data Frames

The `data.frame` is the default data container in R.   It is analogous to both a spreadsheet, at least in the way that I have used spreadsheets in the past, as well as a database.  If you consider a single spreadsheet containing measurements and observations from your research, you may have many columns of data, each of which may be a different kind of data.  There may be `factors` representing designations such as species, regions, populations, sex, flower color, etc.  Other columns may contain numeric data types for items such as latitude, longitude, dbh, and nectar sugar content.  You may also have specialized columns such as dates collected, genetic loci, and any other information you may be collecting.  

On a spreadsheet, each column has a unified data type, either quantified with a value or as a missing value, `NA`, in each row.  Rows typically represent the sampling unit, perhaps individual or site, along which all of these various items have been measured or determined.  A `data.frame` is similar to this, at least conceptually.  You define a `data.frame` by designating the columns of data to be used.  You do not need to define all of them, more can be added later.  The values passed can be sequences, collections of values, or computed parameters.  For example:

```{r}
df <- data.frame( ID=1:5, Names=c("Bob","Alice","Vicki","John","Sarah"), Score=100 - rpois(5,lambda=10))
df
```

You can see that each column is a unified type of data and each row is equivalent to a record.  Additional data columns may be added to an existing data.frame as:

```{r}
df$Passed_Class <- c(TRUE,TRUE,TRUE,FALSE,TRUE)
```

Since we may have many (thousands?) of rows of observations, a `summary()` of the data.frame can provide a more compact description.

```{r}
summary(df)
```


We can add columns of data to the data.frame after the fact using the `$`-operator to indicate the column name. Depending upon the data type, the summary will provide an overview of what is there.

### Indexing Data Frames

You can access individual items within a `data.frame` by numeric index such as:

```{r}
df[1,3]
```

You can slide indices along rows (which return a new `data.frame` for you)

```{r}
df[1,]
```

or along columns (which give you a vector of data)

```{r}
df[,3]
```

or use the `$`-operator as you did for the list data type to get direct access to a either all the data or a specific subset therein.

```{r}
df$Names[3]
```

Indices are ordered just like for matrices, rows first then columns.  You can also pass a set of indices such as:

```{r}
df[1:3,]
```

It is also possible to use logical operators as indices.  Here I select only those names in the data.frame whose score was >90 and they passed popgen.

```{r}
df$Names[df$Score > 90 & df$Passed_Class==TRUE]
```


This is why `data.frame` objects are very database like.  They can contain lots of data and you can extract from them subsets that you need to work on.  This is a VERY important feature, one that is vital for reproducible research.  Keep you data in one and only one place.